University of California at Berkeley

In Chinese, Data Searches Require a Different Approach

 by Fernando Quintero

Anyone who has ever experienced the technological marvel of typing a few key words into an online search and triggering a listing of sources on the most arcane topic will appreciate the kind of research being done by Fred Gey.

Gey and his colleagues at the UC Data Archive and School of Information Management and Systems have been at the forefront of research into effective retrieval of information from large document collectionsăresearch that makes "key word" computer searches possible.

Since 1991, Gey, assistant director of the UC Data Archive, along with Information and Management Systems Professor William Cooper and a number of graduate students, have participated in an annual international showcase and annual conference at which retrieval methods are evaluated.

These conferences bring together top document retrieval researchers from around the world to demonstrate their techniques.

The Berkeley research group, whose algorithms and methods have always performed well in English, now awaits the results of their new Chinese retrieval system at the next Text Retrieval Conference to be held in Washington Nov. 20 to 22.

"Computerized document retrieval for Oriental languages presents a special challenge because they have no blanks or white space to denote word boundaries," said Gey.

"It is as if all words in English were run together."

To accomplish the task of retrieval against a collection of 70,000 news stories from the People's Daily newspaper from Beijing and the Xian Hua News Agency Chinese wire service, the Berkeley group had to modify their computers to read and display Chinese characters.

They also had to extend their text retrieval software to store and retrieve Chinese documents.

Gey said the team's three Information Management and Systems graduate students who hail from the People's Republic of China were invaluable.

For example, a query on "Cases of AIDS in China" would have left out documentation listed under what the students pointed out is another common term for AIDS used in Hong Kong and Taiwan, which loosely translates as "love disease."

The Chinese text retrieval project is a part of a three-year National Science Foundation award Gey received in September for his research into probabilistic document retrieval.

Information Management and Systems Professor Emeritus M.E. Maron was elected a fellow of the American Association for Advancement of Science for his development of theories of probabilistic document retrieval, which files documents in the order of probability that they would be useful to someone looking up a particular subject.

Student researchers are Aitao Chen, Jianzhang He and Lijiang Xu. Technical support is provided by Jason Meggs.


Copyright 1996, The Regents of the University of California.
Produced and maintained by the Office of Public Affairs at UC Berkeley.
Comments? E-mail