Election Web site shows power of new Internet technology for mining the "deep Web"
By Robert Sanders,
01 NOVEMBER 00 | Want to find out which Hollywood stars donated to Vice President Al Gore's Presidential campaign? How about the home prices of the donors to Texas Governor George Bush's campaign? Or the crime rates in the neighborhoods of donors to either candidate?
As this year's presidential campaign climaxes, a Berkeley professor has created a Web site that makes such searches easy, and demonstrates the power of new Internet technology he has developed to mine the "deep Web."
"This is more powerful than search engines on the Web," said Joseph Hellerstein, associate professor of computer sciences in the College of Engineering, who created the site with Computer Sciences Associate Professor Michael Franklin, five graduate students and an undergraduate. "With this you can do real data analysis, not just find a neat new Web page."
The software that Hellerstein and Franklin developed is called Telegraph, after the street near campus. "Like the Berkeley main street after which it is named, Telegraph is the natural thoroughfare for a volatile, eclectic mix coming from all over the world," Hellerstein said.
The "deep Web" refers to Internet information not available by simply following hyperlinks, and thus not accessible through search engines like Google or Inktomi. Some people estimate the deep Web contains 500 times as much information as the rest of the Web, most of it in free databases that require a person to fill out a form in order to submit a query.
The database searches and cross-referencing that Hellerstein, Franklin and their students make available can be done by anyone willing to delve into publicly available databases compiled and run by the Federal Election Commission, the APBNews.com Crime Statistics site, the Yahoo Real Estate database, the Yahoo Actor and Actress List, the U.S. Census, and others. But such painstaking digging is laborious and time consuming.
Hellerstein's Web site makes it easy by automating the form searches, so that information in one database can be brought up for comparison and correlation with information in other databases. The computer does the tedious data searching - "screen scraping" in computer jargon - while the new Berkeley technology choreographs the search in the most efficient way.
"This is about the facts and figures on the Web. Web crawlers can't get to this information," he said.
For example, journalists can trace money spent by political action committees by poring over lists of donors and tracking the money back through numerous other PACs to its corporate or private source. The Berkeley team has set up a way to connect the money easily by automatically "crawling" the donations back to the source.
"You can track the six degrees of separation of PACs - which PACs give to other PACs," Hellerstein said. Philip Morris, doesn't give directly to Bush, but through its PAC gives to other PACs, like the Fund For a Responsible Future, that in turn give to Bush. Similarly, the AFL-CIO doesn't give directly to Gore, but does give to other PACs, like the Evergreen Fund, which give to Gore.
Alternatively, a comparison of crime rates in the neighborhoods of donors to the two candidates show that Bush's donors live predominantly in low crime areas, while Gore's donors are spread out among low and medium crime areas.
And what do Gwyneth Paltrow, Jack Nicholson, Candice Bergen and Jerry Seinfeld have in common? They've each given to the Gore campaign, along with a slew of other actors and actresses. It's hard to identify any Hollywood donors to the Bush campaign.
"With software like this, the publicly available databases are more powerful than people thought," Hellerstein said.
And, he added, potentially more scary. With all the databases on the Web, it also is possible to correlate names with addresses for the entire U.S. population, and cross-reference those with individual home prices, neighborhood crime statistics and even AIDS infection rates. Marketing firms with their own private data could mine veritable gold by combining their data with these deep Web databases.
"This software enables new things, and those things have consequences," Hellerstein said. "Obviously it's good to have some of these databases publicly available, but they can lead to serious invasions of privacy. This technology may cause people to rethink the balance between freedom of information and privacy."
For the moment, Hellerstein and his students have scripted specific database searches and made them available on their Web site, such as a correlation of crime rate with a campaign donor's Zip code. More general comparisons could be allowed, but a user would have to learn how to write a proper query. Many users would not go to that trouble, Hellerstein said, so making that process easier is an area of future research.
Home | Search | Archive | About | Contact | More News
Copyright 2000, The Regents of the University of California.
Produced and maintained by the Office of Public Affairs at UC Berkeley.
Comments? E-mail firstname.lastname@example.org.