Compounds and other oddities in Machine Translation --------------------------------------------------- To translate to and from Germanic languages like Norwegian or English there must be a way of dealing with compounds, preferrably automated. The problem is that one language's compound is another language's phrase. For instance the noun-noun compound "iste" in Norwegian corresponds to "iced tea" or more recently "ice tea" in English. Other times the meaning of the parts does not reflect the meaning of the compound itself, as is the case with all bahuvrihi compounds like "highbrow" or "numbskull". As part of the LOGON machine translation project [1], which aims to translate from Norwegian to English, I am looking at automatic translation of compounds and other phenomena in-between words and clauses. One cannot translate to and from Norwegian without a way of dealing with compounds as compounding is "extremely productive in Norwegian" [2]. Luckily, there already exists a compound analyzer for Norwegian [2] made as part of the Oslo-Bergen tagger. That still leaves the problem of actually translating the compounds. The orthographic definition of compound varies from language to language. In Norwegian all compounds are written without spaces between the stems, though the stems are sometimes separated by the occasional hyphen or epenthetic {e} or {s}. In English however, newly made compounds are generally written with spaces between the stems while older, lexicalized compounds are written without or with hyphens. While some compounds can simply be translated directly, stem by stem, the remaining problem in automaticaly translating from a Norwegian compound to something equivalent in English is that English often uses adjective-noun compositions, "noun of noun" structures and fixed expressions or phrases where Norwegian would use noun-noun composition. Fortunately, large Norwegian-English dictionaries contain plenty of compounds, and the translation-patterns of these can be used to build up frequencies and strategies for automatically translating never before seen compounds. I have access to one such dictionary stored as XML. XML is a way of encoding trees, so I am treating the dictionary as a treebank of aligned words and expressions, using existing tools developed for the Penn Treebank for search and maintenance. LOGON is based on transfer of Minimal Recursion Semantics structures [3] (MRSes), and to properly generate the target string it is necessary to have a parsed version of the English translation of the compound. One way of achieving this is to collate a lexicon of preparsed pairs of translations of compounds, and then try to use these directly when encountering new compounds, e.g. in the fashion of Example-Based Machine Translation [4]. Bilingual dictionaries are also a source for fixed expressions, and these can, to a degree, be treated like compounds by being stored in a preparsed form in a bilingual lexicon for later lookup. Unfortunately, the idiosyncracy of the very nature of fixed expressions like idioms means that previously unseen expressions cannot be easily mapped onto old. However, using bitexts there are several ways of discovering matching expressions which then can be added to the existing repository. The challenge then is to build the "augmented" treebank of known compounds and larger expressions by reducing the dictionary to contain only the relevant entries, generating the appropriate MRSes for these and build an adaptation-mechanism, with lookup and adaptation by orthographic form or MRS. The results will be presented at the conference. References: [1] Lønning, J. T., Oepen, S., Beermann, D., Hellan, L., Carroll, J., Dyvik, H., Flickinger, D., Johannsen, J. B., Meurer, P., Nordgård, T., Rosén, V. and Velldal, E. LOGON. A Norwegian MT effort. Proceedings of the Workshop in Recent Advances in Scandinavian Machine Translation, page 6. Uppsala, Sweden, 2004. [2] Johannesen, J. B. and Hauglin, H. An Automatic analysis of compounds. In T. Haukioja, editor, Papers from the 16th Scandinavian Conference of Linguistics, pages 209--220. Turku/Åbo, Finland, 1996. [3] Copestake, A., Flickinger, D., Malouf, R., Riehemann, S. and Sag, I. Translation using Minimal Recursion Semantics. Proceedings of the 6th. International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-95). Leuven, Belgium, 1995. [4] Carl, Michael, Way, Andy (eds.) Recent Advances in Example-based Machine Translation, 2003. Kluwer, 2003.