Automatic morphological synthesis and analysis for Bulgarian language

Hristo Krushkov

University of Plovdiv

24 Tzar Assen Str, Plovdiv, Bulgaria

hdk@ulcc.pu.acad.bg


Abstract

This paper describes a methodology for automatic morphological synthesis and analysis allowing to build up a morphological processor for inflectional language. The words in the language are divided into classes (inflectional types). Every class has a unique machine number for identification and a list of rules for a generation of a paradigm. For every word a pattern is constructed which matches to all wordforms belonging to the paradigm of this word. The pattern and the inflectional type number incorporate information for the whole paradigm of a particular word. A machine dictionary consists of (word-pattern, inflectional type number) entries. When an arbitrary wordform has to be classified the analyser looks up a matching word-pattern in the dictionary. If such a pattern has been found, using the rules of responding inflectional type a paradigm from this pattern is generated. If the analysed word coincides with a wordform from the generated paradigm it obtains the grammatical features of that wordform. In such way the word is morphologically completely determined. On the basis of this methodology a morphological processor for the Bulgarian language has been built up.

Introduction

Many morphological processors are based on the two level Koskenniemi's model [3,5]. As a general computational model it has many advantages when the language submits to a set of rules. Unfortunately there are many exclusions of these rules in the Bulgarian language. So do other Slavonic languages. Sometimes for adaptation of two-level model to such a language a special module (lexicon input module [2]) is needed.

The presented methodology allows to operate easy with languages comprising reach morphology as the Bulgarian. Moreover in the machine dictionary has been stored only one citation form (pattern) for words with internal inflection belonging to the same paradigm (e.g. pattern - 't*s*n' for wordforms tesen, tqsna, etc...) unlike in [4] and the speed of analysis is faster than [7]. The generation rules are not described using a description language [1,4] but they are automatically extracted from the paradigm.


Inflectional types

Bulgarian inflection is described as a number of grammatical rules:

Rule Example
a/i jena - jeni
ry/yr vryx - vyrxove
reduction of e silen - silni
appending an ending grad - gradove

A classification of Bulgarian inflection in view of the mentioned rules and the grammatical features of the words is presented in [6]. There are 187 different inflectional types in that classification divided into parts of speech as follows: 75 for the nouns, 14 for the adjectives, 41 for the pronouns, 11 for the numerals and 42 for the verbs. Every Bulgarian inflecting word can be classified as a member of some of these types.

From a mathematical point of view the Bulgarian words are divided into disjoint classes of equivalence. Every class has a unique machine number for identification and a list of rules for a generation of a paradigm. A part of speech is a set of classes. Every set can be divided into subsets depending on criteria pertaining to this particular part of speech .

For example the set of nouns includes the classes with machine numbers 1-75. There are 4 subsets depending on the gender as follows: with machine numbers 1-40: masc., 41-53: fem., 54-73: neut., 74-75: only plural.

Two words are in the same class if their paradigms are generated in the same way. The paradigm is described as a list of wordforms with concrete grammatical features for each of them. Every wordform also has a number. Two wordforms with equal numbers have the same grammatical features.

For example in the paradigm of the adjectives, wordform num. 1 has grammatical features (masc., sing.); wordform num. 2 has grammatical features (pl.); etc. (see table below). For all parts of speech wordform num. 1 is the base (citation) form.

wordform number grammatical features
1. masc., sing.
2. pl. / extended form
3. fem., sing.
4. neut, sing.
5. masc., sing., full def. art.
6. masc., sing., short def.art.
7. pl., def.art.
8. fem., def.art.
9. neut., def.art.

An approach for automatic morphological generation and analysis is investigated, based on the presented classification.


Morphological synthesis

For every word a pattern is built up. The pattern and the inflectional type number determine the paradigm of that word. The pattern shows which letters are constant in all wordforms in the paradigm of the word and which are changing. The changing letters are marked with '*' in the pattern.

Automatic pattern extraction from the paradigm.

The pattern can be automatically extracted from the word paradigm following this simple algorithm:

1. Pattern:=''. Extract the first letter of the base-form.

2. If the letter is constant for the whole paradigm

then pattern:=pattern+letter

else pattern:=pattern+ '*'.

3. If there are no more letters in the base form

then end

else extract next letter and go to 2.

For example the paradigm of word evereni(truthful) is:

number/wordform grammatical features
1. veren masc., sing.
2. verni pl. / extended form
3. vqrna fem., sing.
4. vqrno neut., sing.
5. verniq masc., sing., full def.art.
6. verniqt masc., sing., short def.art.
7. vernite pl., def.art.
8. vqrnata fem., def.art.
9. vqrnoto neut., def.art.

The pattern of the word evereni is 'v*r*n'. All other words from the same inflectional type eteseni, ebeseni, edeseni with the following patterns: 't*s*n', 'b*s*n', 'd*s*n' have 2 changing letters (last two vowels) in the pattern.

The pattern involves some features important for morphological analysis as follows:

1. The length of the pattern is less than or equal to the length of every wordform of the paradigm generated from that pattern.

2. The pattern matches the beginning of every wordform with a full coincidence of the constant letters.

Automatic extraction of synthesis rules

The rules for the wordform generation are of 2 types:

1. Replacing the '*' with a letter (including the empty one '').

2. Appending endings.

For example the rules for the above-mentioned inflectional type are the following:

number/wordform rules
1. veren */e; */e
2. verni */e; */'' +i
3. vqrna */q; */'' +a
4. vqrno */q; */'' +o
5. verniq */e; */'' +i+q
6. verniqt */e; */'' +i+qt
7. vernite */e; */'' +i+te
8. vqrnata */q; */'' +a+ta
9. vqrnoto */q; */'' +o+to

The pattern and the inflectional type number incorporate information for the whole paradigm of a particular word. The inflectional type involves for every wordform a list of letters (including the empty one- '') for replacing the symbol '*' of the pattern and morphemes for appending the pattern.

The morphological generation of the paradigm is based on the following simple mechanism:

Every wordform can be constructed from the pattern operating with the rules of replacing(*/letter) and appending(+ morpheme), described after the wordform number. Once extracted, the rules for a member of some inflectional type are the same for all other members of this type. Hereby a table with some words belonging to different inflectional types is presented:

word some rules pattern
teat=r reduction of '=' teat*r
vqra q/e, a/i v*r*
analog g/z analo*
vqt=r q/e, reduction of '=' v*t*r

Data structures for morphological synthesis

Array of letters and morphemes for replacing and appending.

Every rule has a number that is equal to an index in the array of type string. The replacing rules have numbers 30-48, where the num. 30 is for replacing with an empty letter. The appending rules have numbers 1-29. The appending rules for verbs are separated because of their specific endings.

1 a 8 jo 15 sa 22 ;ove 29 jte 36 j 43 c
2 e 9 na 16 ta 23 _ 30 37 k 44 h
3 eve 10 ne 17 te 24 q 31 a 38 n 45 w
4 i 11 o 18 to 25 qt 32 g 39 o 46 =
5 ili]a 12 ove 19 =t 26 ko 33 e 40 s 47 [
6 i]a 13 ovce 20 ; 27 ete 34 z 41 t 48 q
7 iq 14 vci 21 ;o 28 j 35 i 42 x  

Tables with paradigm generation rules.

Tables for every part of speech have been created. The rows of the table correspond to the inflectional types. Every row consists of a list of generation rules' numbers. For example the noun 'vqt=r' with a inflectional type number 4 has a pattern 'v*t*r'. The 4 row of the noun generation table is:

Wordform number 1. 2. 3. 4. 5.
Grammatical

features

base form short def. art. full def. art plural plural, def. art.
List of rules 48,46 48,46,1 48,46,19 33,30,12 33,30,12,17

In fact, some of the rules (e.g., def. art.) are placed out of the table for compactness because they are the same for a subset of inflectional types.

Morphological analysis

The goal of the automatic morphological analysis is to perform automatically a morphological classification of an arbitrary wordform. This includes identifying the base form of the word, its grammatical features and to which inflectional type (part of speech) it belongs. In case of homonymy (when the wordform belongs to more than one inflectional types and has different grammatical features) all possible types must be found.

A machine dictionary consists of (word-pattern, inflectional type number) entries. When an arbitrary wordform has to be classified the analyser looks up a matching word-pattern in the dictionary. If such a pattern has been found, using the second part of the entry pair (inflectional type number) the rules are extracted from the generation table. On the basis of these rules a paradigm from this pattern is generated. If the analysed word coincides with a wordform from the generated paradigm it obtains the grammatical features of that wordform. In such way the word is morphologically completely determined. It obtains 2 formal features:

1. A inflectional type number

2. A wordform number in the paradigm of that type.

Moreover - the first wordform of the generated paradigm is the base-form of the analysed word. The inflectional type number determines the part of speech the analysed word belongs to.

For example if that number is from 1 to 75 that means the word is noun, etc. (see table below).

Type numbers Part of speech
1-75 Noun
76-89 Adjective
90-130 Pronoun
131-141 Numeral
142-187 Verb
188 Adverb
189 Particle
190 Proposition
191 Conjunction
192 Interjection

When the inflectional type belongs to a subset of some part of speech this means that there are grammatical features for that type, missing in the paradigm. For example the gender for the nouns: if inflectional type number belongs to (1 -40) gender is masc., etc. (see table below).

Type numbers Subset
1-40 masc.,
41-53 fem.
54-73 neut.
74,75 only plural

Results and Analysis

A morphological dictionary comprises 67500 patterns which produce over 1.5 million wordforms. It is compressed and its size is 240 Kb. Before analysis the dictionary is loaded into the memory. This allows to perform the analysis with a maximum speed of thousand words per second (PC-386/40 MHz Relative speed (orig. PC=100%) - 5260%). The used software is Borland Pascal 7.0 working on the MS DOS platform.

References

[1] Anick P., Artemieff S. A High-level Morphological Description Language Exploiting Inflectional Paradigms. COLLING'92 - 15th Int. Conf. on Computational Linguistics, Nantes, 23-28 Aug. 1992, pp.67-73

[2] Erjavec T., Tancig P. An integrated system for morphological analysis of the slovene language. COLLING'90 - 13th Int. Conf. on Computational Linguistics, Helsinki, 1990, vol 1:pp.86-88.

[3] Karttunen L., Kaplan R., Zaenen A. Two Level Morphology with Composition. COLLING'92 - 15th Int. Conf. on Computational Linguistics, Nantes, 23-28 Aug. 1992, pp.141-148.

[4] Kichovich M. A Declarative Describtion of the Inflectional Morphology and Morphological Analysis and Synthesis. Conf. on AI, Varna, 1987.

[5] Koskenniemi K. A General Computational Model for Word-Form Recognition and Production. COLLING'84 10th Int. Conf. on Computational Linguistics, Stanford, 2-6 July. 1984, pp.178-181

[6] Krustev B. The Bulgarian Morphology in 187 type tables. NI, Sofia, 1984.

[7] Simov K., Angelova G., Paskaleva E. MORPHO-ASSISTANT: The Proper Treatment of Morphological Knowledge. COLLING'90 - 13th Int. Conf. on Computational Linguistics, Helsinki, 1990, vol 3:pp.455-457.


HOME