UC Berkeley News
Press Release

UC Berkeley Press Release

NEH, Google boost Internet coding project

– New research awards just announced will enable the Script Encoding Initiative (SEI) at the University of California, Berkeley, to continue for the next two years its pioneering work to allow users of the native scripts for all writing systems - from ancient hieroglyphs to Hungarian Runic to Aramaic and more - to use the Internet.

With nearly $300,000 worth of new funding from the National Endowment for the Humanities and Google, the four-year-old initiative will concentrate on encoding eight scripts in 2007 and a like number in 2008.

Modern scripts slated for Internet encoding include Javanese and additions for Native American languages such as Naskapi, Blackfoot and Cree. Historical scripts earmarked for work include Mandaic, the liturgical language of the Mandean religion; and Tangut, the script of an extinct Sino-Tibetan language formerly spoken in northwestern China.

There are more than 80 writing systems not yet in Unicode, the international character-encoding standard used on the Internet. Half of these systems are used in languages spoken by linguistic minorities around the world. Users of these languages must rely on nonstandard fonts that make searching the Web, accessing electronic documents and handling e-mail difficult, if not impossible, said Deborah Anderson, a researcher in UC Berkeley's Linguistics Department and head of the SEI.

"For scholars working with obsolete computing technologies, valuable data is destined for the electronic dustbin unless they are updated to modern computing standards," said Anderson.

"However, with these scripts in Unicode, accurate searching across the Web will be possible, and materials will be saved in a standardized format that will remain accessible for many years to come," she said.

The project will help create fonts for modern scripts so they can be put into use more quickly. It also will identify the local conventions of the languages, such as standard date, time and currency formats, thus enabling the creation of software specific to a given language and the locations where it is spoken.

So far, SEI has encoded more than 20 writing systems, including Balinese and N'ko, which is used for the Mande languages in West Africa.

SEI also works to encode historical languages such as Egyptian hieroglyphics for use by researchers and historians.

Proposing minority and historical scripts for inclusion in Unicode often requires significant research, but Anderson said that communities using these scripts have little economic or political means of participating, so script proposals have been sporadic.

Scripts used by smaller communities have not yet been included in Unicode because they typically don't represent a large market share for software companies. Also, it is harder to find information on the scripts missing from Unicode, Anderson said.

At the current slow pace of encoding, authorities estimate that about 40 scripts will still not be encoded in 10 years. This could lock many linguistic minorities and scholarly communities out of the Information Age, preventing them from searching for text and sending e-mail in their own writing systems and impairing Internet availability of health information and literacy materials in minority languages, Anderson said.

The total project over 10 years is estimated to require upwards of $4 million. The project to date has raised less than 20 percent of this amount.

The Script Encoding Initiative Web page is located at: http://linguistics.berkeley.edu/sei.