Automatic Construction of an Auxiliary Dictionary
for Robust Morphological
Analysis
Hristo Dimitrov Krushkov
University of Plovdiv, Computer Science Dept.
e-mail
hdk@ulcc.pu.acad.bg
Abstract
The purpose of the morphological analysis is to
assign an arbitrary word-form to a morphological class. That
includes identifying of the base-form of the word, its
grammatical features and answering the question what part of
speech it is. The general automatic morphological analyser looks
up in a machine morphological dictionary a prototype (base form
or template) of the word-form with the respective morphological
information. The robust morphological analyser makes
probabilistic morphological classification of iunknowni
word-forms, using an auxiliary dictionary of word-endings. The
paper deals with an approach to automatic construction of such a
dictionary.
Introduction
The robust analysis algorithm allows to classify words as well as their inflectional forms which are not presented in the dictionary. The algorithm is based on the links between the word-form endings and corresponding grammatical information. There are 2 ways for hypothetical morphological classification of unknown words:
In both cases an analysed word-form obtains the
grammatical features of the word stored in the dictionary with
maximum number of matching letters belonging to the endings of
the compared words. In case a) a pattern (inverted) dictionary is
needed, in case b) procedures for automatic word separation are
expected.
An interval approach
If a number of words having the same endings as well as the same grammatical features are situated continuously in an inverted dictionary then we can modify the comparison of endings (case a) into a belonging to an interval. The lower bound of the interval is the first word of the mentioned above sequence (e.g. ab....ls1s2...sn), the upper bound - the last one (e.g. cd....qs1s2...sn). So the interval can be defined by a triple (s=s1s2...sn ,l,q). An auxiliary (interval) dictionary A can be defined as a list of {p=(s,l,q,G)}, where G is the grammatical information for the corresponding interval. Another dictionary called dictionary of exclusions E is needed, the word-forms generated by which could not participate in A. E comprises citation forms which generate word-forms with special endings, which grammatical features differ from the grammatical features of the words with the same endings. So if l=q (the interval consists of only one word-form) we strike the word-form off the interval dictionary and store its citation form in the E.
In this way for the realisation of the robust
morphological analysis is necessary to define procedures and
dictionary A of words (patterns) p1,p2,..,pn
allowing to perform hypothesis for the grammatical features of
the words with likelihood 99%. This is a typical task of the
image recognition theory for determining resolving rules and
patterns in the set of all word-forms. It is intuitively clear
that after appending new lexical units the number of the patterns
does not increase largely (see the diagram).
Through a morphological synthesis all possible word-forms of the nouns, adjectives and verbs are generated. The set of these word-forms is divided into disjoint classes (intervals). All word-forms in a class have the same grammatical features and a pattern, which matches all of them. The citation forms of the word-forms left out of the intervals (exclusive words) are stored in a dictionary, containing also non inflected parts of speech and pronouns. After the striking off the exclusive words a new compressing of the interval dictionary is possible. The neighbouring intervals with the same G joint. In this way the number of patterns is reduced.
The presented approach allows to decrease the
volume of the dictionary necessary for the analysis and to
perform analysis for unknown (without citation form in the
dictionary) or misspelled words.
Two types of auxiliary dictionaries
f F = {NM, NF, NN, NP, Adj, V} where
N is a set comprising all inflectional types of the nouns and N = NM U NF U NN U NP;
NM is a subset of N and consists of all inflectional types of the nouns of masc. gender;
NF is a subset of N and consists of all inflectional types of the nouns of fem. gender;
NN is a subset of N and consists of all inflectional types of the nouns of neut. gender;
NP is a subset of N and consists of all inflectional types of the nouns only in plural;
Adj is a set comprising all inflectional types of the adjectives;
V is a set comprising all inflectional types of the verbs.
g G = {1..Kf} is the word-form number in the paradigm of
the corresponding part of speech.
If f N then Kf=6;
if f Adj then Kf=9;
if f V then Kf=49;
In this way the robust analyser builds up a hypothesis what part of speech is the analysed word (what are it grammatical features, but not to which inflectional type it belongs to.
f T = N U Adj U V,
where T is a union of the described bellow inflectional types. The difference with the previous case is as follows:
While |F|=6 , but |T|=|N|+|Adj|+|V|,
because these three sets are disjoint
g G as a previous case.
The construction of the auxiliary dictionary is based on the following algorithm:
1. All word-forms from the morphological
dictionary are generated. Every word-form has two formal features
f and g. A word-form dictionary is obtained
| d1 | f1 | g1 |
| d2 | f2 | g2 |
| ........... | ....... | ........ |
| dn | fn | gn |
2. Words in the word-form dictionary are inverted. An inverted
word-form dictionary is obtained.
| di1 | f1 | g1 |
| di2 | f2 | g2 |
| ........... | ....... | ........ |
| din | fn | gn |
where di1=inv(d1 ), di2=inv(d2 ), ..... din=inv(dn ) - e.g.( edirectori) = erotceridi
3. An inverted dictionary is sorted in alphabetic order by the first field (inverted word-forms).
4. Intervals obtained in which (gi=gi+1=gi+2=........=gj-1=gj ) as well as (fi=fi+1=fi+2=......=fj-1=fj). That means all the word-forms
_ are with the same grammatical features (in both cases)
_ belong to
the same inflectional types (in the second case).
| ........... | ....... | ........ |
| dii | fi | gi |
| dii+1 | fi+1 | gi+1 |
| dii+2 | fi+2 | gi+2 |
| ........... | ....... | ........ |
| dij-1 | fj-1 | gj-1 |
| dij | fj | gj |
| ........... | ....... | ........ |
5. Only the first and the last elements of every interval are
stored in the auxiliary dictionary, and all intermediate elements
are dropped from it.
| ........... | ....... | ........ |
| dii | fi | gi |
| dij | fj | gj |
| ........... | ....... | ........ |
6. Exclusive words (like dij+1 where fj < fj+1 < fj+2 ) are also dropped from the dictionary and stored in the dictionary of exclusions.
| ........... | ....... | ........ | ........... | ....... | ........ | ||
| dii | fi | gi | dii | fi | gi | ||
| dij | fj | gj | dij | fj | gj | ||
| dij+1 | fj+1 | gj+1 | ------- | >> | |||
| dij+2 | fj+2 | gj+2 | dij+2 | fj+2 | gj+2 | ||
| dik | fk | gk | dik | fk | gk |
7. If fj = fj+2 and gj = gj+2 the two intervals join in a new common interval as
follows:
| ........... | ....... | ........ |
| dii | fi | gi |
| dik | fk | gk |
| ........... | ....... | ........ |
Robust analysis
The robust analysis goes in two steps. On the first step general morphological analysis of the word-form is performed using the exclusive dictionary E. If the word-form is not classified, the second step is robust analysis by using the auxiliary (interval) dictionary A. If the word-forms belongs to an interval it obtains the grammatical information G of this interval. If the word-form is situated between intervals 3 variants are possible:
Practical results
There are two pairs of dictionaries (E, A)
automatically built using the presented approach. A base
morphological dictionary with 65000 entries has been used for
generation of 1,391,050 word-forms. In the first case the
auxiliary dictionary consists of 41,146 intervals, in the second
case - 78,080. In both cases morphological features of the
word-forms have been determined with estimation of 99.4%.
Determining the inflectional type number and the base-form of the
word in the former case is 86.7%, but 99.4% in the latter.
Because of the small size of the base morphological dictionary
(240 Kb) in both cases it has been used for the first step of
analysis instead the exclusive dictionary E.