Compounds and other oddities in Machine Translation
---------------------------------------------------

To translate to and from Germanic languages like
Norwegian or English there must be a way of
dealing with compounds, preferrably automated.
The problem is that one language's compound is
another language's phrase. For instance the
noun-noun compound "iste" in Norwegian corresponds
to "iced tea" or more recently "ice tea" in
English. Other times the meaning of the parts does
not reflect the meaning of the compound itself, as
is the case with all bahuvrihi compounds like
"highbrow" or "numbskull".

As part of the LOGON machine translation project
[1], which aims to translate from Norwegian to
English, I am looking at automatic translation of
compounds and other phenomena in-between words and
clauses. One cannot translate to and from
Norwegian without a way of dealing with compounds
as compounding is "extremely productive in
Norwegian" [2]. Luckily, there already exists a
compound analyzer for Norwegian [2] made as part
of the Oslo-Bergen tagger. That still leaves the
problem of actually translating the compounds.

The orthographic definition of compound varies
from language to language. In Norwegian all
compounds are written without spaces between the
stems, though the stems are sometimes separated by
the occasional hyphen or epenthetic {e} or {s}. In
English however, newly made compounds are
generally written with spaces between the stems
while older, lexicalized compounds are written
without or with hyphens.  While some compounds can
simply be translated directly, stem by stem, the
remaining problem in automaticaly translating from
a Norwegian compound to something equivalent in
English is that English often uses adjective-noun
compositions, "noun of noun" structures and fixed
expressions or phrases where Norwegian would use
noun-noun composition.

Fortunately, large Norwegian-English dictionaries
contain plenty of compounds, and the
translation-patterns of these can be used to build
up frequencies and strategies for automatically
translating never before seen compounds. I have
access to one such dictionary stored as XML. XML
is a way of encoding trees, so I am treating the
dictionary as a treebank of aligned words and
expressions, using existing tools developed for
the Penn Treebank for search and maintenance.

LOGON is based on transfer of Minimal Recursion
Semantics structures [3] (MRSes), and to properly
generate the target string it is necessary to have
a parsed version of the English translation of the
compound. One way of achieving this is to collate
a lexicon of preparsed pairs of translations of
compounds, and then try to use these directly when
encountering new compounds, e.g. in the fashion of
Example-Based Machine Translation [4].

Bilingual dictionaries are also a source for fixed
expressions, and these can, to a degree, be
treated like compounds by being stored in a
preparsed form in a bilingual lexicon for later
lookup. Unfortunately, the idiosyncracy of the
very nature of fixed expressions like idioms means
that previously unseen expressions cannot be
easily mapped onto old. However, using bitexts
there are several ways of discovering matching
expressions which then can be added to the
existing repository.

The challenge then is to build the "augmented"
treebank of known compounds and larger expressions
by reducing the dictionary to contain only the
relevant entries, generating the appropriate MRSes
for these and build an adaptation-mechanism, with
lookup and adaptation by orthographic form or MRS.
The results will be presented at the conference.

References:

[1] Lønning, J. T., Oepen, S., Beermann, D.,
Hellan, L., Carroll, J., Dyvik, H., Flickinger,
D., Johannsen, J. B., Meurer, P., Nordgård, T.,
Rosén, V. and Velldal, E. LOGON. A Norwegian MT
effort. Proceedings of the Workshop in Recent
Advances in Scandinavian Machine Translation, page
6. Uppsala, Sweden, 2004.

[2] Johannesen, J. B. and Hauglin, H. An
Automatic analysis of compounds. In T. Haukioja,
editor, Papers from the 16th Scandinavian
Conference of Linguistics, pages 209--220.
Turku/Åbo, Finland, 1996.

[3] Copestake, A., Flickinger, D., Malouf, R.,
Riehemann, S. and Sag, I.  Translation using
Minimal Recursion Semantics.  Proceedings of the
6th. International Conference on Theoretical and
Methodological Issues in Machine Translation
(TMI-95).  Leuven, Belgium, 1995.

[4] Carl, Michael, Way, Andy (eds.) Recent
Advances in Example-based Machine Translation,
2003.  Kluwer, 2003.