Automatic Construction of an Auxiliary Dictionary

for Robust Morphological Analysis

Hristo Dimitrov Krushkov

University of Plovdiv, Computer Science Dept.

e-mail hdk@ulcc.pu.acad.bg

Abstract

The purpose of the morphological analysis is to assign an arbitrary word-form to a morphological class. That includes identifying of the base-form of the word, its grammatical features and answering the question what part of speech it is. The general automatic morphological analyser looks up in a machine morphological dictionary a prototype (base form or template) of the word-form with the respective morphological information. The robust morphological analyser makes probabilistic morphological classification of iunknowni word-forms, using an auxiliary dictionary of word-endings. The paper deals with an approach to automatic construction of such a dictionary.

Introduction

The robust analysis algorithm allows to classify words as well as their inflectional forms which are not presented in the dictionary. The algorithm is based on the links between the word-form endings and corresponding grammatical information. There are 2 ways for hypothetical morphological classification of unknown words:

  1. Comparison between the ending of the word-form under consideration and of a pattern word-form (stored in a dictionary with related grammatical information);
  2. Recursive step by step separation of all possible prefixes from the word-form and analysis of the right part of the separation.

In both cases an analysed word-form obtains the grammatical features of the word stored in the dictionary with maximum number of matching letters belonging to the endings of the compared words. In case a) a pattern (inverted) dictionary is needed, in case b) procedures for automatic word separation are expected.

An interval approach

If a number of words having the same endings as well as the same grammatical features are situated continuously in an inverted dictionary then we can modify the comparison of endings (case a) into a belonging to an interval. The lower bound of the interval is the first word of the mentioned above sequence (e.g. ab....ls1s2...sn), the upper bound - the last one (e.g. cd....qs1s2...sn). So the interval can be defined by a triple (s=s1s2...sn ,l,q). An auxiliary (interval) dictionary A can be defined as a list of {p=(s,l,q,G)}, where G is the grammatical information for the corresponding interval. Another dictionary called dictionary of exclusions E is needed, the word-forms generated by which could not participate in A. E comprises citation forms which generate word-forms with special endings, which grammatical features differ from the grammatical features of the words with the same endings. So if l=q (the interval consists of only one word-form) we strike the word-form off the interval dictionary and store its citation form in the E.

In this way for the realisation of the robust morphological analysis is necessary to define procedures and dictionary A of words (patterns) p1,p2,..,pn allowing to perform hypothesis for the grammatical features of the words with likelihood 99%. This is a typical task of the image recognition theory for determining resolving rules and patterns in the set of all word-forms. It is intuitively clear that after appending new lexical units the number of the patterns does not increase largely (see the diagram).

Through a morphological synthesis all possible word-forms of the nouns, adjectives and verbs are generated. The set of these word-forms is divided into disjoint classes (intervals). All word-forms in a class have the same grammatical features and a pattern, which matches all of them. The citation forms of the word-forms left out of the intervals (exclusive words) are stored in a dictionary, containing also non inflected parts of speech and pronouns. After the striking off the exclusive words a new compressing of the interval dictionary is possible. The neighbouring intervals with the same G joint. In this way the number of patterns is reduced.

The presented approach allows to decrease the volume of the dictionary necessary for the analysis and to perform analysis for unknown (without citation form in the dictionary) or misspelled words.

Two types of auxiliary dictionaries

Auxiliary dictionary only with grammatical features ( for a draft robust analysis)

f F = {NM, NF, NN, NP, Adj, V} where

N is a set comprising all inflectional types of the nouns and N = NM U NF U NN U NP;

NM is a subset of N and consists of all inflectional types of the nouns of masc. gender;

NF is a subset of N and consists of all inflectional types of the nouns of fem. gender;

NN is a subset of N and consists of all inflectional types of the nouns of neut. gender;

NP is a subset of N and consists of all inflectional types of the nouns only in plural;

Adj is a set comprising all inflectional types of the adjectives;

V is a set comprising all inflectional types of the verbs.

g G = {1..Kf} is the word-form number in the paradigm of the corresponding part of speech.

If f N then Kf=6;

if f Adj then Kf=9;

if f V then Kf=49;

In this way the robust analyser builds up a hypothesis what part of speech is the analysed word (what are it grammatical features, but not to which inflectional type it belongs to.

Auxiliary dictionary with inflectional type numbers ( for a precise robust analysis).

f T = N U Adj U V,

where T is a union of the described bellow inflectional types. The difference with the previous case is as follows:

While |F|=6 , but |T|=|N|+|Adj|+|V|,

because these three sets are disjoint

g G as a previous case.

The construction of the auxiliary dictionary is based on the following algorithm:

1. All word-forms from the morphological dictionary are generated. Every word-form has two formal features f and g. A word-form dictionary is obtained

d1 f1 g1
d2 f2 g2
........... ....... ........
dn fn gn

2. Words in the word-form dictionary are inverted. An inverted word-form dictionary is obtained.

di1 f1 g1
di2 f2 g2
........... ....... ........
din fn gn

where di1=inv(d1 ), di2=inv(d2 ), ..... din=inv(dn ) - e.g.( edirectori) = erotceridi

3. An inverted dictionary is sorted in alphabetic order by the first field (inverted word-forms).

4. Intervals obtained in which (gi=gi+1=gi+2=........=gj-1=gj ) as well as (fi=fi+1=fi+2=......=fj-1=fj). That means all the word-forms

_ are with the same grammatical features (in both cases)

_ belong to the same inflectional types (in the second case).

........... ....... ........
dii fi gi
dii+1 fi+1 gi+1
dii+2 fi+2 gi+2
........... ....... ........
dij-1 fj-1 gj-1
dij fj gj
........... ....... ........

5. Only the first and the last elements of every interval are stored in the auxiliary dictionary, and all intermediate elements are dropped from it.

........... ....... ........
dii fi gi
dij fj gj
........... ....... ........

6. Exclusive words (like dij+1 where fj < fj+1 < fj+2 ) are also dropped from the dictionary and stored in the dictionary of exclusions.

........... ....... ........     ........... ....... ........
dii fi gi     dii fi gi
dij fj gj     dij fj gj
dij+1 fj+1 gj+1 ------- >>      
dij+2 fj+2 gj+2     dij+2 fj+2 gj+2
dik fk gk     dik fk gk

7. If fj = fj+2 and gj = gj+2 the two intervals join in a new common interval as follows:

........... ....... ........
dii fi gi
dik fk gk
........... ....... ........

Robust analysis

The robust analysis goes in two steps. On the first step general morphological analysis of the word-form is performed using the exclusive dictionary E. If the word-form is not classified, the second step is robust analysis by using the auxiliary (interval) dictionary A. If the word-forms belongs to an interval it obtains the grammatical information G of this interval. If the word-form is situated between intervals 3 variants are possible:

  1. the word is misspelled;
  2. the word-form may have the G of the closest to it (by metrics e.g. of Lewenstein) interval;
  3. the word-form may have the G of the other interval

Practical results

There are two pairs of dictionaries (E, A) automatically built using the presented approach. A base morphological dictionary with 65000 entries has been used for generation of 1,391,050 word-forms. In the first case the auxiliary dictionary consists of 41,146 intervals, in the second case - 78,080. In both cases morphological features of the word-forms have been determined with estimation of 99.4%. Determining the inflectional type number and the base-form of the word in the former case is 86.7%, but 99.4% in the latter. Because of the small size of the base morphological dictionary (240 Kb) in both cases it has been used for the first step of analysis instead the exclusive dictionary E.


HOME