Alternative Data Models and ROOT - Interfacing to a Database Data Model

Vicky White, Herb Greenlee, Jim Kowalkowski, Marjorie Shapiro
September 24, 1998

Report on preliminary investigations of how the ROOT data model maps onto persistent storage in a relational database

Goals of the Investigation and Areas of Concern

Transparent storage/retrieval - Database just another 'file format'

One goal for interfacing a database data model to ROOT is to provide an easy, and largely transparent, way for persistent storage of data objects to be done using relational database tables. This is a goal of the D0 object persistency mechanism.

It is not at all clear that there is a hard REQUIREMENT to provide this type of interface for ROOT.

Whether it would be needed or not depends on just how much and what type of analysis we intend to do with ROOT. For example, do we need Calibration and Geometry Data in order to carry out some ROOT analyses?
How likely is it that parts of the Event Data could be profitably stored in a relational database, in order to provide sophisticated query and reporting facilities not supported in other persistent file formats?

However, if the data models for D0OM and ROOT are merged, as discussed in the other Data Model report, then it would be unnatural to provide mechanisms for reading C++ objects from a database using the D0OM package, but not to provide that functionality somehow through ROOT.

Mapping data between C++ objects and rows in database tables

Everyone seems to agree that the mapping of data between C++ objects and their representation as rows in database tables is not a trivial issue and that it requires more information than is present within a C++ header files (ie. described by the ROOT data model).

Herb Greenlee has managed to make a mapping between in memory C++ objects and rows in database tables by driving the structure of the database, in a fixed mapping, from the structure of the C++ class. He also makes other simplifying assumptions which eliminate ambiguities about how objects are to be uniquely identified and how base classes and inherited classes are to be interpreted. Although such a mapping is therefore possible, there are numerous drawbacks to this approach and, in reality, the database ‘data model’ is normally driven not from the C++ class description, but from practical database design, access and
performance considerations.

Other major differences between a complex/composite object in memory and its representation in a database relate to
    a)Whether inheritance is understood in the database representation, or whether only final ‘leaf’ classes are known and can be read/written or queried? Example:   Base Class Particle, Derived classes Electron, Muon.   2 tables in the database for all Particles? Could I instantiate a Particle instance?   To support inheritance fully in a database involves much complexity and some clear rules. Fully supporting inheritance is almost certainly not desirable.
    b) Database objects must have unique persistent Identifiers, enabling relationships between them to exist persistently. The generation of unique keys or Ids is normally the purview of the database, and MUST be in cases where the data members themselves cannot be combined to form a unique key. That unique ID is not known until a database write transaction has completed. Correctly referencing unique persistent IDs when writing out complex trees of objects is not a simple streaming model of writing data.
    c) The behavior desired when writing out a complex tree-structured object of class A (A1) which references an instance of an object of class B (B1) must be specified.   How is the distinction made between a reference (or relationship) to an already existing persistent object B1, versus a reference to another instance of the object of class B, contained in, or a part of, the initial object A1. What syntax in the ROOT data model allows that distinction to be made?   Without a distinction between the relationships of ‘refers to’ and ‘contains a ’ in the syntax of your data model you cannot decide how to write out data. A C++ pointer or smart pointer is normally considered to be ‘contains a ’ when dealing with Event data. When dealing with non-event data, or references to non-event data within an event (if this should be allowed), these issues become important and need clarification.
    d) Containers which contain references (or keys) to persistent objects are different from Containers which contain the actual persistent objects. Containers which contain neither the data members themselves, nor pointers to them, (in a C++ data model sense), but which when instantiated must be filled with data members from a database are yet another concept.   This latter type of container is the most natural approach for data which is stored in a database. No ROOT (or D0OM) data model syntax exists to describe a Container which 'builds itself' upon instantiation, although in the case of D0OM we have discussed dealing with this case via specialized types of smart pointers, or possibly specialized types of containers.
    e) When you write an object to a database and find that there is already an object with that unique ID (assuming you know how to give it a unique ID) what do you want to happen - error? or update and overwrite the pre-existing object? No ROOT (or D0OM) data model syntax exists to map an object onto a unique persistent ID, or to describe whether a write is a 'write-new', or an 'update existing'.
    f) When you write objects to a database in a multi-user environment, issues of locking of database rows and/or tables must be handled to assure integrity of the database. Additionally there are user/password and connection issues to a database, which govern access rights, which must be handled somewhere in the i/o interface.

Extensions to the Data Model to support Database specifics

Because of the above ambiguities (and others --- we will not be exhaustive) it is clear to several of us that some sort of Private Data Definition Language (PDDL) is needed to describe the data model of any data which is to be stored persistently in a database. C++ header files in the ROOT, or D0OM, data model could be produced from such a PDDL but it would contain additional information pertinent only to database issues and database schema. It would not support the full generality of an object database, but would contain sufficient mapping concepts for all of OUR needs.

To support use of, and interpretation of, such a PDDL directly in ROOT some further database specific classes might be needed (eg. within ROOT extra Txxxx classes). These would support behaviors specific to database-capable objects and containers - such as behaviors related to unique Ids (or keys), container behavior, queries to locate a particular object, user name and password concepts.

In any case, wherever we intend to map data between C++ in-memory objects and database rows we will need a degree of specificity not present in a C++ header file. Work must therefore be done to generate either

The specific code which performs the mapping for a particular class (ROOT Streamer way of working) OR
Dictionary Data which describes the C++ class, the database structure and the rules for conversion and can be used by generic code which performs the mapping for all classes (D0OM data-dictionary driven way of working)

This generated code, or generic code + generated data, must become part of a whatever persistency mechanism is used.

Integration of Database extensions to the data model with ROOT

The questions to be answered are:

Given that the ROOT data model (and indeed the D0om data model) is inadequate, when taken alone, to describe how to map data to/from its database representation

a) how can whatever mechanisms we implement in order to do this mapping be integrated into ROOT?

this could be done through some direct ROOT interface implementing a specific type of streaming, or through interfacing ROOT to a database server designed to carry out most of the hard work. The interface to the Database server, although largely 'streaming', would then need to be better defined.

b) how exactly can/could a private data definition language (PDDL) parser be used along with ROOT?

refer to the report on CINT for this issue

c) what implications would having a PDDL and parser for it have for CINT?

refer to the report on CINT for this issue

d) which classes in ROOT would need to be modified or extended to support the Database functionality?

e) how would these mechanisms for database persistency co-exist with other existing ROOT i/o mechanisms provided for TFile, TSocket, etc. based on either Streamer methods or Data Dictionary driven TTree traversal?

there would clearly be some additional classes and functionality needed - the design for such classes needs to be discussed further and various ideas for how to implement them brought to the table.

Recommendations to Address the Areas of Concern

1) Firstly we must decide if there is merit in providing a general database persistency mechanism for ROOT objects (and their necessarily enhanced description, as discussed above).

It is possible to consider that the ROOT data model is disconnected from any database persistency issues and provides only for the streaming of objects to/from a data sink/source. However, this only delegates all database-related issues and ambiguities discussed above to a database server implementing the data source/sink. The work still has to be done in the database server, (as well as work to provide and parse a private DDL), in order to map data between database tables and a representation described by the ROOT data model. So really, this is ROOT work, whether or not it is considered ‘within’ the ROOT product. The alternative approach of hand-coding the behavior of each class to be read/written from/to the database, and re-implementing the resolution of ambiguities and specific behaviors and policies (such as for persistent Ids), for each class stored in the database is not a viable one in my opinion. Since the Database Server and PDDL work probably has to be done anyway by both CDF and D0, for mapping to some sort of C++ objects, the integration issues with ROOT, and the re-use of the code between CDF Calibration work, D0OM’s d0streamOracle, and ROOT should be further explored.

2) We must address how a query is to be performed in order to locate a specific object, or collection of objects, in a database. Even if no general database persistency mechanism is deemed worthwhile, some additional syntax/classes are required to provide this minimal query functionality. This is perhaps where a TSQL class would come in in ROOT. This needs further design work and specification.

3) How unique keys or identifiers should be generated or identified (given data already in a database) needs to be designed. This is particularly important if we might end up with data in a database being read into either a ROOT object or a D0OM object, (or a CDF calibration object), or in some other format (say into Python objects) from the same database tables, but using different persistency packages.

4) If the ROOT team believe that a PDDL is not necessary in order to describe exactly how to map and handle database-persistent objects, we need to understand what alternative schemes and extensions to the ROOT data model they propose (such as use of particular comments to denote behaviors required) to address some of the ambiguities and behaviors listed in section 1.

Tasks to be Accomplished (with their relative benefits) to Achieve the Recommendations

Needs more investigation before specific tasks can be identified

Cost of Accomplishing the Tasks - $ and FTEs

Since we are talking only about carrying out further investigation and design work, this could probably be accomplished with 2 people spending 2 or 3 weeks on it. Say 4 FTE weeks in total.

Other

One of the key issues in understanding how mechanisms which support database data models and behaviors can be incorporated and work within ROOT is understanding how modular and flexible CINT is and how to build Translaters
or Parsers which would augment and work alongside CINT. We defer to the CINT report on this issue – its conclusions are essential to our understanding of what can be done with databases.

If you have comments or suggestions, email me at white@fnal.gov