Libraries have always taken a long-term perspective on preserving and providing access to information

Multilinguality Issues in Digital Libraries

Hachim Haddouti

FORWISS (Bavarian Research Center for Knowledge-Based Systems)

Orleanstr. 34

80689 Munich, Germany

In the near future, a lot of digital libraries will be set up containing large collections of information in a large number of languages. However, it is impractical to submit a query in each language in order to retrieve these multilingual documents. Therefore, a multilingual retrieval environment is essential for benefiting from worldwide information resources.

In many applications, documents such as geographic maps, bibliographic data, directory services, etc. written in non-Roman scripts such as Japanese, Arabic, Chines, Hebrew, etc. are transliterated into Roman characters. Transliteration matches characters syntactically; it does not translate meaning. This technique leads to a huge loss of information. For instance, the Spanish language contains many Arabic words which have been more or less transliterated. There are some words which are unfamiliar for the Arabic-native speakers because they sound very different from the original Arabic phonetics, e.g. Ojalà (hopefully), Gibraltar (the British colony of Gibraltar in Spain).

Other types of loss are accents, umlauts and other language-specific mark variants forms of words, will not match properly. Search engines will search for what a user types in, sometimes using fuzzy logic for stemming and query expansion. But how can you search for documents written in Arabic if your terminal does not support Arabic input? Think of name giving for organizations, buildings and so on. In most cases the names are given in the local language. Many OPACs are accessible via the Internet, but it is difficult to read records from a foreign terminal if they are encoded in non-ASCII codes.

The approach of cross-language information retrieval allows users to formulate queries in one language and to retrieve documents in others. Using dictionary-based technique queries will be translated into a language in which a document may be found. However, this technique sometimes yields unexpected results because many words do not have only one translation and the alternate translations have very different meanings. Additionally, the scope of a dictionary is limited. For instance, it lacks technical and topical terminology which is essential to interpret the query correctly. The corpus-based technique seems promising. It analyzes large collections of existing texts and automatically extracts the information needed on which the translation will be based. However, this technique tends to require the integration of linguistic constraints, because the use of only statistical techniques by extracting information can introduce errors and thus achieve bad performance.

The development of multilingual retrieval systems is very limited because of high cost and complexity. Most of those applications are based on thesauri which are very expensive to implement. Stemming, word boundary identification, and lists of stopwords must be defined. Appropriate input and presentation methods must be supported, e.g. right-left, left-right, or top-bottom. From language to language term indexing differs. Some languages are written without spaces between words. In this case character-based indexing is more suitable than word-based indexing. Thus, profound investigations in the following areas are necessary: machine translation systems, natural language processing, advanced linguistic processing tools, morphological analysis, lexical semantic information extraction, terminology extraction, algorithms for alignment of translated texts.

Several information retrieval system issues have already been studied for other languages than the English language. Examples of these languages are Chinese, Japanese, Spanish, etc. Both Spanish and Chinese retrievals have been evaluated in TREC 96 (Text Retrieval Evaluation Conference, NIST, 1996).

In the WWW, there is often a problem presenting a foreign home page written in "foreign" languages and non-western languages, such as Chinese, Arabic, Thai, etc. WWW has originally paid little or no attention to issues such as character encoding, multilingual documents, or specific requirements of particular languages and scripts. We cannot expect that every user should install fonts for all character sets in order to display documents written in Arabic, Chinese, Greek, etc. The well-known web browsers support mainly ISO-8895-1 and several western language specificity. Some local solutions have been implemented by Web Browser provider to meet the national and local needs. However, these browsers are usually limited to a few languages only, e.g. Arabic web browsers cannot display documents written in Greek or Chinese. These local solutions lead to isolation.

Unicode seems to be the Santa Claus for character set and data exchange problems. However, migrating of legacy data should be loss free and at minimum of costs. Unicode is a single 16-bit which allows encoding of more than 65000 characters. This means that most known languages in the world are covered by this code. Most operating systems, Microsoft, IBM, DEC, Sun and Apple use Unicode. HTML 3.0 is proposing to establish Unicode as the reference character set for future Web pages. Alis Technology produced Tango which supports all business languages for display and input purposes. Accent's Multilingual Mosaic is based on Unicode. The Microsoft Front Page Editor supports Unicode HTML as well. Java was designed with Unicode. Database systems Oracle, Sybase, Informix, Adabas provide Unicode support.

The development of digital libraries is becoming increasingly relevant. However, most research and development activities are focussed on only one language. But, this is not the objective of the digital libraries and Internet philosophy. Both technologies aim at establishing a global digital library containing all information resources from different areas, different countries and in different languages. The access to those materials should be assured for the worldwide community and never restricted because of non-understanding languages. Thus, the EU funded some projects addressing the multilingual issues. For instance, in the ESPRIT project EMIR (European Multilingual Information Retrieval) a commercial information retrieval system SPIRIT has been developed which supports French, English, German, Dutch and Russian. UNESCO launched some projects in order to democratize and globalize the access to the world cultural patrimony, such as Memory of the World. Recently, UNESCO has started the MEDLIB project which aims at creating a virtual library for the Mediterranean region. The Mediterranean basin presents different cultural, linguistic and historical patrimonies. The language diversity of this region will play the key role in launching projects addressing the multilingual issues. The sooner the Mediterranean community becomes involved in these discussions and projects, the sooner we will benefit from the high-tech development worldwide.

Interesting works in this area have been presented by Bruce Croft [1], Mark Davis [2], Christian Fluhr [3], Douglas Oard [4], Carol Peters [5], Shigeo Sugimoto [6], etc. An interesting working group on Multilingual Information Access for Digital Libraries has been established. The aim of this group, which is sponsored by the U.S. NSF (National Science Foundation) and the EU (European Community), is to bring the American and European scientists to plan common research agenda and to discuss research issues and results in this area (more about this group is available at: http://www.cs.columbia.edu/~klavans/Activities/MIA/home.html).

[1] W. B. Croft. What Do People Want from Information Retrieval? (The Top 10 Research Issues for

Companies that Use and Sell IR Systems). D-Lib Magazine, November 1995

(http://www.dlib.org/dlib/november95/11croft.html)

[2] M. W. Davis and Ted E. Dunning. A TREC evaluation of query translation methods for multi-lingual

text retrieval. In D. K. Harman, TREC-4, NIST, November 1995

[3] C. Fluhr, D. Schmit, F. Elkateb, P.Ortet, K. Gurtner . Multilingual database and crosslingual

interrogation in a real internet application, in Working Notes of AAAI Spring Symposium on Cross-

Language Text and Speech Retrieval, Stanford, CA,1997

[4] D.W. Oard and B.J. Dorr. A survey of multilingual text retrieval, Technical Report UMIACS- TR-96-19,

University of Maryland, Institute for Advanced Computer Studies, 1996

[5] C. Peters, E. Picchi. Across Languages, Across Cultures: Issues in Multilinguality and Digital Libraries.

D-Lib Magazine, May 1997 (http://www.dlib.org/dlib/may97/peters/05peters.html)

[6] A. Maeda, T. Fujita, L. S. Choo, T. Sakaguchi, S. Sugimoto, K. Tabata. A Multilingual Browser for

WWW without Preloaded Fonts. In Proc. of ISDL95, August 1995