Report on preliminary investigations of how the ROOT data model maps
onto persistent storage in a relational database
Herb Greenlee has managed to make a mapping between in memory C++ objects
and rows in database tables by driving the structure of the database, in
a fixed mapping, from the structure of the C++ class. He also makes
other simplifying assumptions which eliminate ambiguities about how objects
are to be uniquely identified and how base classes and inherited classes
are to be interpreted. Although such a mapping is therefore
possible, there are numerous drawbacks to this approach and, in reality,
the database ‘data model’ is normally driven not from the C++ class
description, but from practical database design, access and
performance considerations.
Other major differences between a complex/composite object in memory
and its representation in a database relate to
a)Whether inheritance is understood in the database
representation, or whether only final ‘leaf’ classes are known and can
be read/written or queried? Example: Base Class Particle,
Derived classes Electron, Muon. 2 tables in the database for
all Particles? Could I instantiate a Particle instance?
To support inheritance fully in a database involves
much complexity and some clear rules. Fully supporting inheritance is almost
certainly not desirable.
b) Database objects must have unique persistent
Identifiers, enabling relationships between them to exist persistently.
The generation of unique keys or Ids is normally the purview of the database,
and MUST be in cases where the data members themselves cannot be combined
to form a unique key. That unique ID is not known until a database
write transaction has completed. Correctly
referencing unique persistent IDs when writing out complex trees of objects
is not a simple streaming model of writing data.
c) The behavior desired when writing out a complex
tree-structured object of class A (A1) which references an instance
of an object of class B (B1) must be specified. How is the
distinction made between a reference (or relationship) to an already existing
persistent object B1, versus a reference to another instance of the
object of class B, contained in, or a part of, the initial
object A1. What syntax in the ROOT data model allows that distinction
to be made? Without a distinction
between the relationships of ‘refers to’ and ‘contains a ’
in the syntax of your data model you cannot decide how to write out data.
A C++ pointer or smart pointer is normally considered to be ‘contains a
’ when dealing with Event data. When dealing with non-event data,
or references to non-event data within an event (if this should be allowed),
these issues become important and need clarification.
d) Containers which contain references (or keys)
to persistent objects are different from Containers which contain
the actual persistent objects. Containers which contain neither the data
members themselves, nor pointers to them, (in a C++ data model sense),
but which when instantiated must be filled with data members from a database
are yet another concept. This latter type of container
is the most natural approach for data which is stored in a database.
No ROOT (or D0OM) data model syntax exists to describe
a Container which 'builds itself' upon instantiation, although in
the case of D0OM we have discussed dealing with this case via specialized
types of smart pointers, or possibly specialized types of containers.
e) When you write an object to a database and find
that there is already an object with that unique ID (assuming you know
how to give it a unique ID) what do you want to happen - error?
or update and overwrite the pre-existing object? No
ROOT (or D0OM) data model syntax exists to map an object onto a unique
persistent ID, or to describe whether a write is a 'write-new', or an 'update
existing'.
f) When you write objects to a database in
a multi-user environment, issues of locking of database
rows and/or tables must be handled to assure integrity of the database.
Additionally there are user/password and connection issues to a database,
which govern access rights, which must be handled somewhere in the i/o
interface.
To support use of, and interpretation of, such a PDDL directly in ROOT some further database specific classes might be needed (eg. within ROOT extra Txxxx classes). These would support behaviors specific to database-capable objects and containers - such as behaviors related to unique Ids (or keys), container behavior, queries to locate a particular object, user name and password concepts.
In any case, wherever we intend to map data between C++ in-memory objects and database rows we will need a degree of specificity not present in a C++ header file. Work must therefore be done to generate either
Given that the ROOT data model (and indeed the D0om data model) is inadequate, when taken alone, to describe how to map data to/from its database representation
a) how can whatever mechanisms we implement in order to do this mapping be integrated into ROOT?
e) how would these mechanisms for database persistency co-exist with other existing ROOT i/o mechanisms provided for TFile, TSocket, etc. based on either Streamer methods or Data Dictionary driven TTree traversal?
1) Firstly we must decide if there is merit in providing a general
database persistency mechanism for ROOT objects (and their necessarily
enhanced description, as discussed above).
It is possible to consider that the ROOT data model is disconnected from any database persistency issues and provides only for the streaming of objects to/from a data sink/source. However, this only delegates all database-related issues and ambiguities discussed above to a database server implementing the data source/sink. The work still has to be done in the database server, (as well as work to provide and parse a private DDL), in order to map data between database tables and a representation described by the ROOT data model. So really, this is ROOT work, whether or not it is considered ‘within’ the ROOT product. The alternative approach of hand-coding the behavior of each class to be read/written from/to the database, and re-implementing the resolution of ambiguities and specific behaviors and policies (such as for persistent Ids), for each class stored in the database is not a viable one in my opinion. Since the Database Server and PDDL work probably has to be done anyway by both CDF and D0, for mapping to some sort of C++ objects, the integration issues with ROOT, and the re-use of the code between CDF Calibration work, D0OM’s d0streamOracle, and ROOT should be further explored.
2) We must address how a query is to be performed in order to locate a specific object, or collection of objects, in a database. Even if no general database persistency mechanism is deemed worthwhile, some additional syntax/classes are required to provide this minimal query functionality. This is perhaps where a TSQL class would come in in ROOT. This needs further design work and specification.
3) How unique keys or identifiers should be generated or identified (given data already in a database) needs to be designed. This is particularly important if we might end up with data in a database being read into either a ROOT object or a D0OM object, (or a CDF calibration object), or in some other format (say into Python objects) from the same database tables, but using different persistency packages.
4) If the ROOT team believe that a PDDL is not necessary in order to describe exactly how to map and handle database-persistent objects, we need to understand what alternative schemes and extensions to the ROOT data model they propose (such as use of particular comments to denote behaviors required) to address some of the ambiguities and behaviors listed in section 1.
Needs more investigation before specific tasks can be identified
If you have comments or suggestions, email me at white@fnal.gov