Run II Physics Analysis Software Functional Requirements
May 18, 1998
Eileen Berman, Iain Bertram, Pushpa Bhat, Frank Chlebana, Mark Edel, Sarah Eno,
Irwin Gaines, Herb Greenlee, Paul Lebrun, Qizhong Li, Lee Lueking, Kaori Maeshima, John
Marraffino, Pasha Murat, Larry Nodulman, Dane Skow, Kris Sliwa, Steve Vejcik,
Avi Yagil, Andrzej Zieminski
This document describes the functional requirements for Physics Analysis
Software for Run II. In general, these requirements should cover all software
needed to access, analyze, and present, in reports and publications, data at the
volumes that will need to be handled in RUN II. The requirements are organized
into several categories representing the major functions of access, analysis,
and presentation, with a final category dealing with usability issues.
Requirements containing the word "must" are mandatory requirements, and failure
to meet these would disqualify a product from consideration. Requirements
containing the word "should" are desirable requirements; all other things being
equal a product satisfying more of the desirable requirements would be
preferred over one satisfying fewer. Each section of requirements is preceded
with descriptive text giving some background justifying the specific needs.
DATA ACCESS
Data access requirements must allow data of various different formats to be
retrieved for subsequent analysis in online, offline interactive and offline
batch environments. The rate of access must support common online and offline
uses. Events must be able to be accessed both serially and randomly, and data
must be accessible with chunk sizes smaller than entire events.
It is unrealistic to expect all experiments to use a common data format (or even
for a single experiment to use the same format for all stages of analysis).
Data formats must support different optimizations based on different access
patterns. These considerations lead to the requirements on input of foreign
data formats and creation of specialized output formats.
Detailed Requirements:
- Access rates (online): The tool must be able to
be used in an online environment where data is being accessed in real time.
- Access rates (offline): The tool must be able
to access very large (at least several TB) data sets. It should be able to
combine results from accessing several different data streams.
- Serial vs random access: The analysis tool must
be able to efficiently read a serial stream of data at at least 90% of the
bandwidth provided by the storage media on which the data resides (ie, the
analysis tool should not impose any significant additional overhead for serial
reads). The tool must also allow random access to individual events within a
larger event stream without undue overheads. The tool must support reading (and
writing) data to all of the various devices in a mass storage system hierarchy.
- Granularity of access: The tool must provide
mechanisms for reading only a portion of an event without using up I/O bandwidth
for the unread portions of the event. This may require the data to be
reformatted into a specialized optimized format with some pre-knowledge of the
granularity that the physicists will request.
- Foreign Input and Output Formats: The analysis tool must
provide a hook for user supplied conversion routines to read foreign data
formats. Similarly there must be a user hook allowing foreign output formats.
- Specialized output formats: The tool must
allow data to be read in one format and written out in another. It is highly
desirable for the tool to provide certain specialized formats that optimize data
access bandwidth based on expected access patterns.
DATA ANALYSIS
Data analysis consists of the related processes of selecting samples of events,
performing analysis on these samples by calculating various mathematical
functions from the data in the selected events, allowing interactive variation
both in the selection criteria and the calculations performed, preserving
samples of events in specialized (optimized) formats for later re-analysis, and
preserving the functions and selection criteria.
One important tool for this analysis is a scripting language which allows the
physicist to specify both the selection criteria and mathematical operations to
be applied to the data, and to control the overall analysis, plotting and
presentation environment. Thus this scripting language must contain some of the
functionality of programming languages and some of command line or menu driven
control interfaces.
However, the basic requirement is that the analysis tool support a rich
interactive environment that supports easy control of data access and analysis
description as well as interactive development of physics algorithms, with some
level of compatibility with offline code so algorithms developed with the
analysis tool are usable offline, and offline code can be incorporated in
analysis. It is felt that the most effective paradigm to meet this requirement
is to require the analysis tool to support linking with external high level
language (HLL) routines. The scripting language then does not need to be
identical to any particular high level language (or subset of a language) as
long as it allows basic data access, commands, simple evaluations, flow control
and looping, and, most importantly, invoking of precompiled or dynamically
linked high level language procedures. It is also important for the scripting
language to support the offline object model for data. There is no requirement,
however, for COMIS-like interactive functionality as long as the scripting
language supports links to HLL routines.
It might be argued that portability and ease of use (and learning)
considerations would suggest that the scripting language be identical to some
existing HLL. However, it is felt that dynamic linking is a better way to
support portability and offline compatibility. Even if the scripting language
shares its syntax with some HLL, it will need to have many new commands to
support data plotting and presentation that are not native to the HLL anyway.
Moreover, the interactive scripting language will never be totally identical to
the HLL on which it might be based, causing problems with new bugs and limited
portability. If was therefore concluded that there is no requirement for the
scripting language to be derived from some HLL, although it is recognized that
when used carefully such a scripting language can have certain advantages.
Detailed Requirements:
Scripting Language:
- The analysis tool must include a full featured scripting language,
as commonly understood, and as outlined below.
- The scripting language must have some understanding of events as objects,
as opposed to some simpler structure, such as arrays of numbers. The
analysis tool's object model should be compatible with standard
object oriented programming languages, such as C++. Note that PAW's
columnwise ntuple event model does not really meet this requirement.
- The scripting language must be able to extract data (as built-in data
types or sub objects) from events objects for histogramming, printing,
or other processing.
- The scripting language must be able to express complex mathematical
expressions using event data.
- The scripting language should have debugging facilities.
- It must be possible to interface the scripting language to dynamically
linked compiled high level languages, such C, C++, or Fortran.
User Control:
- The scripting language must support all control functions necessary to
specify data access and selection, sequence of operations, screen layout and
plotting, fitting, etc.
- Mathematical operations must be able to be interleaved with user
stipulated sequences of control messages to the analysis package.
- Results of preliminary, intermediate and final stages of analysis must
be available to users at relevant times and in appropriate storage formats.
- The scripting language must support command line recall and interacive
command line editing.
Data Selection:
- It must be possible to make decisions and to program selection criteria
based on event data using data extracted by any of the above methods,
so that only selected events are histogrammed, output, or subjected
to some kind of further processing.
- The analysis tool should be able to display selection criteria as
text (on histograms or for printed output, etc.).
Input/Output:
- The analysis tool should support its own object I/O format.
- The analysis tool should include libraries that allows its own format
object files to be read or written from compiled programs.
- The analysis tool must be able to read or write object files in
foreign formats using (user supplied) external modules.
- The scripting language must be able to write selected event objects
to one or more output streams based on arbitrary selection criteria.
- The analysis tool should provide an object definition language
and/or be able to define new object formats programmatically.
- From the previous criteria, it must be possible within the scripting
language to read events in one format, convert and write them out
in a different format.
- The analysis tool should support "virtual streaming," which means
that it can tag a set of selected events, and read them back, without
physically writing them to a separate output stream.
Numeric and Mathematical Functionality:
- The analysis package must include accurate and precise numerical
functionality, including double precision.
- Analysis capabilities must be able to be applied to data presented to
the front end interface as well as to subsequent renditions of the data
(such as binned histograms).
- Functions operating on multiple data sets (such as K-S tests of
multiple histograms) must be included.
- Mathematical operations must include the ability to fit data,
parameterize data, and calculate statistical quantities from data using
accessible and supported libraries or repositories of functions or
programs.
- Fitting procedures must allow user control of fitting algorithms.
Offline Compatibility:
- The package must allow users to tailor the sequence of mathematical
operations which will define an analysis on a set of data. Mathematical
operations include both functional operations on data as well as fits to
data. The source code used for the mathematical operations should be
available to users.
- Users must be able to include external software in their analysis.
Such software must be accommodated whether written in C++ or Fortran (or
other approved high level languages) as
either source code or as part of object libraries.
- A broad range of the functionality of the analysis package must be able to
be linked into user defined C++ or Fortran (or other approved HLL) code.
Prototyping:
- Control and mathematical routines must be able to be developed in ways that
allow prototyping of simple versions which can later be expanded upon.
- Prototyped sequences should contain as much of the full interface of
an arbitrarily complex version as possible. Elements of the interface
less important to user operation should be hidden.
DATA PRESENTATION
The results of data analysis must be able to be viewed interactively, saved in
standard formats for presentation to colleagues and for inclusion in informal
and formal publications. The analysis software needs to provide interactive
tools to modify the various features of graphical presentations (colors, labels,
etc), and once the user is satisfied with the presentation on a computer
terminal the software needs to preserve essentially this exact image.
Detailed Requirements:
- Interactive visualization: The analysis tool must provide a rich
interactive environment for creating, controlling and displaying histograms,
scatter plots, lego plots, and other graphical representations of the data.
Functionality should be at least that traditionally used in products like PAW
and Histoscope, including such things as interactive control of the look of the
display (colors, labels, etc), bin size, scales, ability to overlay fits or
other distributions, arrangement on the screen, etc. The configuration of
graphical objects must be able to be stored to be applied later to the same or
other data samples. Graphical objects must be able to be combined, compared,
and otherwise processed (eg, adding or subtracting two histograms).
- Presentation quality graphical output:
Any of the graphical objects prepared interactively must be able to be
preserved in some set of standard representations (postscript, pdf, gif or
jpeg) suitable for printing offline, including in web pages, or e-mailed to
collaborators. The user must not have to know in advance that a
particular graph will be so preserved, but must be able to decide after
having viewed (and modified as desired) the graph.
- Formal publication of graphical output: Any of the
graphical objects produced interactively must be able to be formatted for
inclusion in formal publications. It should be easy to adjust certain
parameters of the display (for example, font size of labels) to meet
journal publication requirements.
USABILITY
Besides the specific functions described above, the software needs to obey
certain rules to ensure it can be widely and effectively used. These include
areas such as portability, performance, modularity, robustness, use of
standards, etc.
Detailed Requirements:
- Batch vs. interactive processing: Analysis tools must be capable of running both interactively and in batch mode.
Scripts derived from an interactive session must be able to be passed to a
batch job to reproduce the interactive analysis on a larger sample of
events.
- Sharing data structures: At user option, data (and command) structures
of various types must be capable
of being made available to others, with some granularity on how widely the
permission is granted (for example world-wide access, experiment-wide access, or
physics group-wide access). This access must be granted to files of special
types of data preserved in an analysis job, to selected samples of standard
format data, to analysis macros and selection criteria, and to definitions of
graphical output produced by an analysis job.
- Shared access by several clients: For online use, data structures (such as histograms) used for display purposes
must be capable of being dynamically updated by other running processes. The
data structures should be able to be shared among several jobs all having
simultaneous read access to the data structure, thus allowing the plots to be
viewed by several different users.
- Parallel processing (using distinct data streams): The analysis system must be capable of processing large numbers of events
efficiently. If a single processor is not capable of providing the required
throughput, the system should support simple parallel processing where different
servers analyze separate event streams, with the results being automatically
combined before presentation.
- Debugging and profiling: Good, robust and reliable debuggers are
required in code development. Thus, the scripting language should have a debugger.
This requirement is not satisfied simply by the scripting language being
interactive and executed one line at a time, but must support such
functionality as conditional breakpoints, etc. Likewise, profiling is particularly relevant when building large software
systems. Seamless integration of the debugger/profiler is highly desirable.
- Modularity (user code): The analysis system (or framework) must be able to accommodate user-written modules,
so that these modules can be interactively called. These modules are written
in the preferred compiled language (C, C++, or Fortran), or in the scripting
language, and can be executed within the "framework".
This capability can be either based on dynamical linking, pipes, RPC calls &
shared memory access on UNIX systems, or similar access methods. The data structures
created in the user code with the compiled
language must be accessible while running the interactive scripts, from within the
"framework" of the analysis tool. It is also desirable that all user-written methods or functions
be accessible in an interactive session.
- Modularity (system code): The routines making up the analysis package
itself must be capable of being linked into offline batch processes without
requiring the entire analysis framework to be included.
- Access to source code: Access must be provided to any shareware or
freeware software components. Some mechanism for source code access
should be provided for commercial components.
- Robustness: Lack of robustness falls into two categories: the first being things for which the user is responsible
(pilot error), and the second being missing or faulty system resources which
the user had the right to expect were present and functioning but which were,
in fact, missing or broken. The first class is connected with the user's interaction
with the system and suggests that the user interface must pay
attention to validating the user's input before acting on it and
potentially doing serious damage. Errors of
this sort can and should almost always be identified, reported and perhaps
even logged from within the interface. Thus the user interface should also be
regarded as a sort of gate keeper, denying access to the internals of the
system unless the action request is properly formed and completely valid
within the current context. The second class of exceptions
tends to be related to the system's management of it's resources. Simply hanging or crashing when,
for instance, the event data server is unavailable is not acceptable. The
analysis system should exhibit as low a level as possible of such failures.
- Web based documentation: The analysis system documentation (including
tutorials and examples) must be available on the world wide web.
- Use of standards: Where there is an industry standard available, it
should be adopted as part of the analysis package even
if other HEP labs have not done so. Where there is no acceptable industry standard but
some sister lab or major experiment has developed a tool that survives critical
inspection, it could be adopted.
- Portability: The selected analysis
software must be able to run on both desktop systems and
from centrally available servers. It is desirable to
move the analysis task to the computers hosting the
appropriate data sets. Current platforms of interest are SGI, Linux,
Windows NT, Digital Unix, AIX and Solaris, with the versions of the operating
systems as specified in the standard Copmuting Division support standards. The ideal
package should support all of these platforms but at least
one of: Linux and Windows NT and two of: SGI IRIX,
Digital Unix, IBM AIX and Sun Solaris must be supported.
Demonstrated ability to port the analysis code to new OS
versions and platforms is a benefit.
- Scalability: The analysis software must be able to
gracefully scale from analysis of a handful of input data
files (<10GB) to analyses run over several hundred (if
not thousands) of input data files. Any optimizations
based on having data sets being resident in (real or
virtual) memory must be able to be disabled and cannot
have tremendous impact on tool function for datasets
exceeding memory capacities. The software must be
configurable to enable many tens (~100) of simultaneous
users on large central server machines. Machine resources
(memory, CPU, network bandwidth) required by the analysis
processes should be well managed and well suited to the
likely configurations of central servers and desktops. It
is highly desirable that there be simple facilities for
running analysis jobs in parallel and then combining the
individual results in a statistically correct manner.
- Performance: The analysis software must be able to do
simple presentations (eg. 2D histograms of files with an
event mask) at disk speed (at least 3MB/s input and faster on higher
performance systems). Plot
manipulations and presentation changes of defined
histograms must be rapid and introduce no noticeable
delays to the user. Performance penalties for user
supplied code (eg. routines from reconstruction code)
must not be more than a factor of 2 over native
(unoptimized) compiled code run standalone.
- User Friendliness: Learning to utilize the software to
the level of reading in a file of number pairs and
plotting the result should not take a competent physicist
more than 4 hours. Evaluators should be able to
become proficient to the level of defining an input
stream, performing a moderate selection/analysis
procedure including user supplied code and producing a
result suitable for presentation within 2 weeks. Manuals
should be lucid, complete, affordable and available.
Software application presentation and interface should be
common to all supported platforms and data and
macro-like-recipes must be easily exchangeable between
all platforms. Support for detailed questions
about internal operations of the software on data,
numerical methods, API formats and requirements, and
output formating (both data and plots) must be available
preferably directly to the users, but at least to a
moderate (~10) number of "experts" from each
experiment. The software must be configurable to remember
users preferences and customizations and should allow for
multiple levels of customization (eg. user, working
group, collaboration, Lab) for local definitions (eg.
printers) and enhancements.