Computing by the Truckload by Mike Perricone
Just for a moment, put aside the technical questions of how the recent
$800,000 purchase of more than 400 PCs will be used to advance scientific
research at Fermilab’s DZero experiment during the quest for new physics
at Collider Run II of the Tevatron.
Focus on the logistical issues involved in this largest single purchase of
computers (by number of units) in lab history:
What do you do when 400 computers show up on a truck? How and where
do you make room for them, even without monitors and keyboards? What
do you do with more than 400 boxes, and with all the styrofoam and the little
plastic bags inside? Where do you keep all these packing materials in case
you have to send stuff back during the 30-day trial period?
And after you have them unpacked, where do you plug in all these
computers? Do you even have 400 electrical outlets? Do you start them
all up at once? Will that implode all your circuit breakers? And those
400 little fans blowing out heat from the backs of the CPU cases—will
they turn your air-conditioned data center into a sauna?
When some of the 400 inevitably fuss, who fixes them? Who backs up
the files?
“Building a data center today is a formidable challenge,” said Gerry Bellendir
of Fermilab’s Computing Division, who has helped coordinate the process of
getting the computers into the lab and getting
them installed. The final count was 434 computers:
400 at the Feynman Computing Center and
another 34 for the online system at DZero.
“People thought data centers would go away when
desktops were introduced,” said Bellendir, whose
lab service dates back to 1969. “But people want to
use computers like a radio or telephone. They don’t
want to install new systems, update software, back
up files. And the enormous amounts of data from
the experiments must be maintained in an airconditioned
environment, with fire protection
systems.”
First comes the investment of effort involved in
the ordering, receiving, checking, and moving the
shipment to the Feynman Center.
Bellendir emphasized the pivotal contributions of
Fermilab’s shipping, receiving, warehousing and
property departments. All the receiving, unpacking,
checking, tagging and repacking was done at
Site 38, the lab’s shipping and receiving center.
In addition, all the empty boxes (and styrofoam,
and cardboard inserts and little plastic bags)
were stored for the 30-day trial of “burn-in” period.
Combustible materials are not permitted in the
computing rooms at Feynman.
The computers—Atipa Technologies Athlon
1.67 GHz dual CPUs—will process the data and
prepare experimental results for analysis by DZero
collaborators. The collection of machines will be
used to run many parallel jobs. The 400 dual CPUs
at the Feynman Center produce close to the effect
of 800 computers in 400 housings. At Feynman,
they are being stacked in 25 racks, each holding
16 units in six square feet of floor space, with
240 on the second floor and 160 on the first floor.
More comparatively large-scale purchases are
coming: 240 for the CDF collaboration, and 72 for
the Tier I computing center of U.S/CMS, located at
Fermilab. The Compact Muon Solenoid detector
(CMS) will operate with the Large Hadron Collider
at CERN in Geneva, Switzerland.
Cables and cooling ducts were installed before
the computers arrived. Bellendir said that while
miniaturization puts more computing power
into a smaller footprint, there is a cost. These
installations, for example, will boost power and
cooling requirements by 50 percent at the
Computing Center.
“It’s one of the main problems facing data centers
today,” Bellendir said.
The increased need for computing power reflects
the geometric expansion of data in high-energy
physics experiments, with DZero entering the
physics analysis phase full force for Run II—the
impetus for this purchase. Wyatt Merritt, head of
DZero Computing and Analysis in the Computing
Division, has seen computing become an
increasing share of experiment hardware, with
continual additions and upgrades beginning in
the earliest stages of commissioning and testing
the detector and its components.
“Then, when we reach the moment of truth where
commissioning is complete and everything is
running at or above design rates and sizes,”
Merritt said, “we put in place the last bit of
equipment needed—and then immediately start
to replace the first bits we bought, because at
least some parts of the computing plant have
a useful lifetime of less than five years.”
Now, while at their peak potential, the majority
of the computers the 260 computers on the
second floor of Feynman) will be used largely to
reconstruct the data from particle collisions at the
same pace they are witnessed by experimenters.
The raw data consists of independent events that
are collected and written to large files, with each
file then sent to a PC for reconstruction. These
reconstructions are used to identify candidates for
electrons, photos, jets and muons; to determine
their location within the detector and measure their
energy and/or momentum. They are also used to
find “missing energy,” which indicates the presence
of neutrinos in the detector. The 140 computers on
the first floor are for user analysis: applying the
output of the reconstruction for experimenters
(users) to examine and select the data samples
they need for their areas of physics analysis. The
34 units at DZero will provide additional computing
power for online event selection as the Tevatron
luminosity increases during the run.
From buying to burn-in
The first step in obtaining 434 PCs is ordering them—after deciding
what’s needed, that is. Lisa Giacchetti of Computing Division’s Operating
System Support Department/Scientific Computing Support coordinated
the process of compiling a Request for Bid document specifying
requirements of the systems (type and speed, memory, quantity and
capacity of disk, etc.), rack requirements, network and serial connection
wiring requirements and more. The request went out to PC vendors
who have passed Fermilab qualifications. Giacchetti also coordinated
components needed for installing and running the machines: power,
networking, floor space, receiving, tagging, safety. Atipa Technologies
won the bid. Once the order was shipped, receiving the large number
of PCs meant enlisting the help of receiving, warehouse and property
staffers. The entire receiving operation was conducted not at Feynman
Computing Center but at Site 38. Each PC was removed from its box,
inspected for damage, tagged, loaded 16 to a skid, shrink-wrapped, and
held at Site 38 until the Computing Center had room to bring them over.
Computing’s Equipment Logistics Services group also spent a great deal
of time helping at Site 38. The PCs had an acceptance period of 30 days,
and all the boxes and packing materials were held at Site 38. Not only was
there no room at Feynman, but the computing center will no longer allow
boxes or skids, or other combustibles, into the computer room. Once the
PCs—not household types, but still valuable commodities—were moved to
Feynman, they were secured overnight in locked cages. Atipa representatives
installed the units in the racks. Computing administrators booted up
the systems, one unit at a time, and began the 30-day “burn-in” with a suite
of software tools designed to stress the various hardware components
(CPU, memory, disk, network). The computers must meet specifications
or units can be returned.
“We delayed buying machines for as long as
possible to get the maximum computing power
within our budget,” said Amber Boehnlein,
co-leader of DZero Software and Computing.
“Experimenters need to look at the data quickly,
identify and fix any problems in the detectors or
the software as quickly as possible, and make
sure that our physics goals are met by finishing
analyses in a timely manner. The amount of data
we write to tape is increasing as the luminosity
improves.”
To keep the data flowing, the power must keep
flowing. The Computing Center recently added
a new array of Uninterruptible Power Supplies,
and installed a generator capable of supplying the
entire building and all its needs in the event of a
power outage at the laboratory. The infrastructure
improvements were made under the lab’s Utilities
Incentives Plan, a federal program that allows
investments of funds and expertise by utility
companies, with savings from the improvements
used to pay back the utilities’ initial investments.
How to keep the data flowing in the future,
throughout Run II, is a question under study.
The Computing Division believes it has the
resources to ensure smooth running through
FY’05, but has commissioned a study to
examine alternatives. Beyond that time, Fermilab’s
Associate Director for Operations Support, Jed
Brown, has commissioned a working group of
computing and the lab’s Facilities Engineering
Services Section to formulate a 10-year plan for
infrastructure supporting computers.
“We also participate in a consortium called the
Uptime Institute, which is dealing with these types
of issues,” Bellendir said. “We’re trying to stay
in tune with the industry and where it’s going.
But we don’t know what the technology will be
two or three years from now. How do we estimate
what’s 10 years away? That’s the question we’re
all facing.”
On the Web:
|
last modified 9/17/2002 email Fermilab |
FRLsDFx9eyfrPXgV