NEWS RELEASE, 12/17/98
UC Berkeley students break sorting record, showing
the advantage of off-the-shelf hardware and clusters of inexpensive computers
By Robert Sanders, Public Affairs
BERKELEY -- A group of students at the University of California, Berkeley, now holds the world record for data sorting, having zoomed ahead of the competition, including Compaq Computer Corporation.
One of the key benchmarks in the world of commercial computing is how fast you can sort data - alphabetically, numerically or in other ways.
UC Berkeley undergraduates Joshua Coates and Spencer Low, along with graduate student Philip Buonadonna, recently broke the world record. Working with UC Berkeley's Millennium Project, they completed a standard benchmark sort, the Datamation sort, in 1.18 seconds - less than half the time of the previous record holder, another group at UC Berkeley.
Compaq is rumored to have recently broken the previous record, too, but the new UC Berkeley time is even faster.
A trophy is given to the champion sorter each spring at the top meeting for database geeks, the SIGMOD meeting, or Special Interest Group on Management of Data of the Association for Computing Machinery. That trophy will go to UC Berkeley if no one tops its record before next year's conference on May 31 in Philadelphia.
Sorting is one of the most common operations done by computers. Banks, credit card companies and government agencies all rely on a fast sorting mechanism to keep data flowing and information available to their systems. In fact, it is estimated that one-third of all processing or CPU (central processing unit) time in the world is spent on sorting data.
"In commercial computing, that's where the money is at," said David Culler, professor of computer science at UC Berkeley and head of the Millennium Project, which sponsored the students' work.
The feat was achieved using a cluster of 16 off-the-shelf PCs, each with two Pentium processors and two hard disks, running the Windows NT 4.0 operating system. The computers were networked together using an experimental protocol called VIA (Virtual Interface Architecture) developed as a commercial standard by Intel, Microsoft and Compaq.
"It's really something that you can do this with commodity PCs - they give you more bang for the buck," Coates said.
Most computer sorting today uses high-end machines, though clusters of computers are becoming more popular. Now most database companies produce software for clusters, said Culler.
"This shows that clusters are continuing to play an attractive role in sorting," he said. "The attractiveness of clusters is going to change the industry."
The Datamation sort has been used for more than a decade to determine the fastest sorter. It involves reading one million, 100-byte records off a disk, sorting them and then writing the data back to disk. This amount of data is about 100 Megabytes, or an estimated 20 million English words.
The previous record, 2.41 seconds, was set by the NOW (Network of Workstations) group at UC Berkeley in 1996. Recently, Compaq allegedly surpassed that Datamation sort record at 2.2 seconds.
Last week the UC Berkeley students cut that time nearly in half. When first proposed as a test of sorting systems in the mid 1980s, the sort took about 980 seconds.
The students originally set out to test the new VIA architecture with a real-world application, for a class project for Advanced Operating Systems, a course taught by Eric Brewer, assistant professor of electrical engineering and computer science. Brewer is known for founding Inktomi Corporation, which markets an internet search engine that uses clustered computers. That search engine started as a project in the same class several years ago.
"Distributed sorting is an excellent way to test VIA and the cluster as a whole," Coates said.
When they found out that the cluster of computers was performing extremely well, they tested it with the Datamation sort.
Though the Datamation sort is the oldest, other benchmark sorts exist, too. A second, the "minute sort," tests how many records can be sorted in a minute. That record is held by UC Berkeley's NOW group. A predecessor of the Millennium Project, NOW used a cluster of 100 high-end Unix workstations to set that record several years ago.
Compaq initially tried to break both the minute sort and Datamation sort records, using a similar cluster of Pentiums running Windows NT and using the VIA architecture. When they failed they created a third test, the "terabyte sort," which involves reading, sorting and rewriting to disk a million megabytes, or a terabyte, of data. A terabyte of data is much larger than almost all computer users deal with today, but such huge data sets are becoming more common, Culler said.
In November Compaq and Sandia National Laboratories announced they had broken the record for a terabyte sort, performing it in 50 minutes.
The Millennium Project is not able to attempt the terasort or minute sort yet, though Culler says they may try later.
The main goal of the Millennium Project is to demonstrate that you can take a large number of mass-market, commodity PCs and networks, harness them together with some special software, and get a powerful supercomputer for less than the current prices of supercomputers.
"The question is, how do you harness PCs together and get them to handle like a supercomputer?" Buonadonna said.
The "glue" the team used to hold its cluster together is called DCOM, which is Microsoft's "Distributed Component Object Model" - a binary level protocol that allows distributed objects in a cluster to talk to each other. Coates said that, though overly complicated, DCOM performed exceptionally well in the Millennium sort for remote execution.
The Millennium Project, http://now.cs.berkeley.edu/Millennium/, is funded
by Intel, maker of Pentium processors.
Send comments to: firstname.lastname@example.org