Speedily Searching the Web

How the Legendary Spider Inktomi Outdoes Yahoo, Infoseek and Lycos at Finding Things in Cyberspace

by Robert Sanders

Two computer scientists have introduced parallel computing to the Internet to create the fastest and most comprehensive "engine" now available to search the World Wide Web.

Called Inktomi, it searches a database of more than 1.3 million documents on the World Wide Web, a network that reaches around the world to provide ready access to words, pictures, sound and video. Inktomi is the largest index of web documents.

Inktomi addresses one of the main problems of the web today: as the number of documents on it skyrockets it becomes a challenge to index every one, and time-consuming to search the index. While Internet surfers happily skip from web site to web site in search of "cool" links, the Internet's true potential will be felt only when users can quickly search for and find desired sites.

"It's getting increasingly difficult to find things on the Internet," says Eric Brewer, an assistant professor of computer science who developed Inktomi with graduate student Paul Gauthier. "The problem is, it's very hard to have a large database and get good performance. With parallel computing you can have larger databases and high performance. Because we use commodity workstations, we have a much cheaper solution than anyone."

Parallel computing involves stringing many computers or microprocessors together to work on a problem simultaneously, a potentially faster and more powerful method than tackling the problem with a single large computer.

Inktomi (pronounced "ink to me") is the name of a mythological trickster spider of the Plains Indians. The search engine can be found at the web address http://inktomi. berkeley.edu.

The scientists are quick to distinguish their directory, which is a comprehensive index of documents on the web, from directories such as the popular Yahoo, which is a select list of web documents more akin to a table of contents.

Yahoo, started a year and a half ago by two Stanford graduate students, maintains addresses for perhaps 50,000 of the most useful documents on the web.

"With Inktomi you can find a lot more things than with Yahoo, but both are useful," Brewer says. "We're providing a more comprehensive search engine for the web without sacrificing speed."

An equivalent search engine is Infoseek, which is as fast as Inktomi but can accommodate only one-fifth the documents; or Lycos, which indexes slightly more than a million documents but is significantly slower than Inktomi.

The new search engine is one of the first fruits of a collaborative project at Berkeley to tie common desktop computers or workstations--just your average PC--into a powerful "network of workstations." Dubbed NOW, the project hopes to harness the power of inexpensive PCs into a parallel computer with the capabilities of a supercomputer--at a fraction of the cost.

Brewer emphasizes that parallel computing brings a unique power to search engines of any kind, whether they are searching a database of WWW addresses or a library catalogue. The major advantage is "scalability," that is, as the database increases he merely adds more inexpensive computers to maintain the system's quick response.

Gauthier and Brewer built Inktomi using four outdated Sun workstations, and have designed it so that if three break down, Inktomi continues to have access to the entire index although at a reduced rate--reliability unmatched by search engines operating out of a single computer.

To find and catalogue all the addresses, Brewer and Gauthier developed a web crawler that periodically looks for new addresses. Here too parallel computing is important. Taking advantage of 32 networked computers within Berkeley's computer science building, Soda Hall, they relegate to several at a time the task of discovering new web sites, often while the computers are being used by others.

Access to Inktomi is supported by the NOW project in the Division of Computer Science of the College of Engineering.


Copyright 1995, The Regents of the University of California.
Produced and maintained by the Office of Public Affairs at UC Berkeley.
Comments? E-mail berkeleyan@pa.urel.berkeley.edu.