Automatic morphological synthesis and
analysis for Bulgarian language
Hristo
Krushkov
University of Plovdiv
24 Tzar Assen Str, Plovdiv, Bulgaria
hdk@ulcc.pu.acad.bg
Abstract
This paper describes a methodology for automatic
morphological synthesis and analysis allowing to build up a
morphological processor for inflectional language. The words in
the language are divided into classes (inflectional types). Every
class has a unique machine number for identification and a list
of rules for a generation of a paradigm. For every word a pattern
is constructed which matches to all wordforms belonging to the
paradigm of this word. The pattern and the inflectional type
number incorporate information for the whole paradigm of a
particular word. A machine dictionary consists of (word-pattern,
inflectional type number) entries. When an arbitrary wordform has
to be classified the analyser looks up a matching word-pattern in
the dictionary. If such a pattern has been found, using the rules
of responding inflectional type a paradigm from this pattern is
generated. If the analysed word coincides with a wordform from
the generated paradigm it obtains the grammatical features of
that wordform. In such way the word is morphologically completely
determined. On the basis of this methodology a morphological
processor for the Bulgarian language has been built up.
Introduction
Many morphological processors are based on the two level Koskenniemi's model [3,5]. As a general computational model it has many advantages when the language submits to a set of rules. Unfortunately there are many exclusions of these rules in the Bulgarian language. So do other Slavonic languages. Sometimes for adaptation of two-level model to such a language a special module (lexicon input module [2]) is needed.
The presented methodology allows to operate easy with
languages comprising reach morphology as the Bulgarian. Moreover
in the machine dictionary has been stored only one citation form
(pattern) for words with internal inflection belonging to the
same paradigm (e.g. pattern - 't*s*n' for
wordforms tesen, tqsna, etc...)
unlike in [4] and the speed of analysis is faster than [7]. The
generation rules are not described using a description language
[1,4] but they are automatically extracted from the paradigm.
Inflectional types
Bulgarian inflection is described as a number of
grammatical rules:
| Rule | Example |
| a/i | jena - jeni |
| ry/yr | vryx - vyrxove |
| reduction of e | silen - silni |
| appending an ending | grad - gradove |
A classification of Bulgarian inflection in view of the mentioned rules and the grammatical features of the words is presented in [6]. There are 187 different inflectional types in that classification divided into parts of speech as follows: 75 for the nouns, 14 for the adjectives, 41 for the pronouns, 11 for the numerals and 42 for the verbs. Every Bulgarian inflecting word can be classified as a member of some of these types.
From a mathematical point of view the Bulgarian words are divided into disjoint classes of equivalence. Every class has a unique machine number for identification and a list of rules for a generation of a paradigm. A part of speech is a set of classes. Every set can be divided into subsets depending on criteria pertaining to this particular part of speech .
For example the set of nouns includes the classes with machine numbers 1-75. There are 4 subsets depending on the gender as follows: with machine numbers 1-40: masc., 41-53: fem., 54-73: neut., 74-75: only plural.
Two words are in the same class if their paradigms are generated in the same way. The paradigm is described as a list of wordforms with concrete grammatical features for each of them. Every wordform also has a number. Two wordforms with equal numbers have the same grammatical features.
For example in the paradigm of the adjectives, wordform
num. 1 has grammatical features (masc., sing.); wordform num. 2
has grammatical features (pl.); etc. (see table below). For all
parts of speech wordform num. 1 is the base (citation) form.
| wordform number | grammatical features |
| 1. | masc., sing. |
| 2. | pl. / extended form |
| 3. | fem., sing. |
| 4. | neut, sing. |
| 5. | masc., sing., full def. art. |
| 6. | masc., sing., short def.art. |
| 7. | pl., def.art. |
| 8. | fem., def.art. |
| 9. | neut., def.art. |
An approach for automatic morphological generation and
analysis is investigated, based on the presented classification.
Morphological
synthesis
For every word a pattern is built up. The pattern and the
inflectional type number determine the paradigm of that word. The
pattern shows which letters are constant in all wordforms in the
paradigm of the word and which are changing. The changing letters
are marked with '*' in the pattern.
Automatic pattern
extraction from the paradigm.
The pattern can be automatically extracted from the word paradigm following this simple algorithm:
1. Pattern:=''. Extract the first letter of the base-form.
2. If the letter is constant for the whole paradigm
then pattern:=pattern+letter
else pattern:=pattern+ '*'.
3. If there are no more letters in the base form
then end
else extract next letter and go to 2.
For example the paradigm of word evereni(truthful)
is:
| number/wordform | grammatical features |
| 1. veren | masc., sing. |
| 2. verni | pl. / extended form |
| 3. vqrna | fem., sing. |
| 4. vqrno | neut., sing. |
| 5. verniq | masc., sing., full def.art. |
| 6. verniqt | masc., sing., short def.art. |
| 7. vernite | pl., def.art. |
| 8. vqrnata | fem., def.art. |
| 9. vqrnoto | neut., def.art. |
The pattern of the word evereni is 'v*r*n'. All other words from the same inflectional type eteseni, ebeseni, edeseni with the following patterns: 't*s*n', 'b*s*n', 'd*s*n' have 2 changing letters (last two vowels) in the pattern.
The pattern involves some features important for morphological analysis as follows:
1. The length of the pattern is less than or equal to the length of every wordform of the paradigm generated from that pattern.
2. The pattern matches the beginning of every wordform
with a full coincidence of the constant letters.
Automatic extraction
of synthesis rules
The rules for the wordform generation are of 2 types:
1. Replacing the '*' with a letter (including the empty one '').
2. Appending endings.
For example the rules for the above-mentioned inflectional
type are the following:
| number/wordform | rules |
| 1. veren | */e; */e |
| 2. verni | */e; */'' +i |
| 3. vqrna | */q; */'' +a |
| 4. vqrno | */q; */'' +o |
| 5. verniq | */e; */'' +i+q |
| 6. verniqt | */e; */'' +i+qt |
| 7. vernite | */e; */'' +i+te |
| 8. vqrnata | */q; */'' +a+ta |
| 9. vqrnoto | */q; */'' +o+to |
The pattern and the inflectional type number incorporate information for the whole paradigm of a particular word. The inflectional type involves for every wordform a list of letters (including the empty one- '') for replacing the symbol '*' of the pattern and morphemes for appending the pattern.
The morphological generation of the paradigm is based on the following simple mechanism:
Every wordform can be constructed from the pattern
operating with the rules of replacing(*/letter) and appending(+
morpheme), described after the wordform number. Once extracted,
the rules for a member of some inflectional type are the same for
all other members of this type. Hereby a table with some words
belonging to different inflectional types is presented:
| word | some rules | pattern |
| teat=r | reduction of '=' | teat*r |
| vqra | q/e, a/i | v*r* |
| analog | g/z | analo* |
| vqt=r | q/e, reduction of '=' | v*t*r |
Data structures for
morphological synthesis
Array of letters and morphemes for replacing and appending.
Every rule has a number that is equal to an index in the
array of type string. The replacing rules have numbers 30-48,
where the num. 30 is for replacing with an empty letter. The
appending rules have numbers 1-29. The appending rules for verbs
are separated because of their specific endings.
| 1 a | 8 jo | 15 sa | 22 ;ove | 29 jte | 36 j | 43 c |
| 2 e | 9 na | 16 ta | 23 _ | 30 | 37 k | 44 h |
| 3 eve | 10 ne | 17 te | 24 q | 31 a | 38 n | 45 w |
| 4 i | 11 o | 18 to | 25 qt | 32 g | 39 o | 46 = |
| 5 ili]a | 12 ove | 19 =t | 26 ko | 33 e | 40 s | 47 [ |
| 6 i]a | 13 ovce | 20 ; | 27 ete | 34 z | 41 t | 48 q |
| 7 iq | 14 vci | 21 ;o | 28 j | 35 i | 42 x |
Tables with paradigm
generation rules.
Tables for every part of speech have been created. The
rows of the table correspond to the inflectional types. Every row
consists of a list of generation rules' numbers. For example the
noun 'vqt=r' with a inflectional
type number 4 has a pattern 'v*t*r'. The 4
row of the noun generation table is:
| Wordform number | 1. | 2. | 3. | 4. | 5. |
| Grammatical features |
base form | short def. art. | full def. art | plural | plural, def. art. |
| List of rules | 48,46 | 48,46,1 | 48,46,19 | 33,30,12 | 33,30,12,17 |
In fact, some of the rules (e.g., def. art.) are placed
out of the table for compactness because they are the same for a
subset of inflectional types.
Morphological
analysis
The goal of the automatic morphological analysis is to perform automatically a morphological classification of an arbitrary wordform. This includes identifying the base form of the word, its grammatical features and to which inflectional type (part of speech) it belongs. In case of homonymy (when the wordform belongs to more than one inflectional types and has different grammatical features) all possible types must be found.
A machine dictionary consists of (word-pattern, inflectional type number) entries. When an arbitrary wordform has to be classified the analyser looks up a matching word-pattern in the dictionary. If such a pattern has been found, using the second part of the entry pair (inflectional type number) the rules are extracted from the generation table. On the basis of these rules a paradigm from this pattern is generated. If the analysed word coincides with a wordform from the generated paradigm it obtains the grammatical features of that wordform. In such way the word is morphologically completely determined. It obtains 2 formal features:
1. A inflectional type number
2. A wordform number in the paradigm of that type.
Moreover - the first wordform of the generated paradigm is the base-form of the analysed word. The inflectional type number determines the part of speech the analysed word belongs to.
For example if that number is from 1 to 75 that means the
word is noun, etc. (see table below).
| Type numbers | Part of speech |
| 1-75 | Noun |
| 76-89 | Adjective |
| 90-130 | Pronoun |
| 131-141 | Numeral |
| 142-187 | Verb |
| 188 | Adverb |
| 189 | Particle |
| 190 | Proposition |
| 191 | Conjunction |
| 192 | Interjection |
When the inflectional type belongs to a subset of some
part of speech this means that there are grammatical features for
that type, missing in the paradigm. For example the gender for
the nouns: if inflectional type number belongs to (1 -40) gender
is masc., etc. (see table below).
| Type numbers | Subset |
| 1-40 | masc., |
| 41-53 | fem. |
| 54-73 | neut. |
| 74,75 | only plural |
Results and Analysis
A morphological dictionary comprises 67500 patterns which
produce over 1.5 million wordforms. It is compressed and its size
is 240 Kb. Before analysis the dictionary is loaded into the
memory. This allows to perform the analysis with a maximum speed
of thousand words per second (PC-386/40 MHz Relative speed (orig.
PC=100%) - 5260%). The used software is Borland Pascal 7.0
working on the MS DOS platform.
References
[1] Anick P., Artemieff S. A High-level Morphological Description Language Exploiting Inflectional Paradigms. COLLING'92 - 15th Int. Conf. on Computational Linguistics, Nantes, 23-28 Aug. 1992, pp.67-73
[2] Erjavec T., Tancig P. An integrated system for morphological analysis of the slovene language. COLLING'90 - 13th Int. Conf. on Computational Linguistics, Helsinki, 1990, vol 1:pp.86-88.
[3] Karttunen L., Kaplan R., Zaenen A. Two Level Morphology with Composition. COLLING'92 - 15th Int. Conf. on Computational Linguistics, Nantes, 23-28 Aug. 1992, pp.141-148.
[4] Kichovich M. A Declarative Describtion of the Inflectional Morphology and Morphological Analysis and Synthesis. Conf. on AI, Varna, 1987.
[5] Koskenniemi K. A General Computational Model for Word-Form Recognition and Production. COLLING'84 10th Int. Conf. on Computational Linguistics, Stanford, 2-6 July. 1984, pp.178-181
[6] Krustev B. The Bulgarian Morphology in 187 type tables. NI, Sofia, 1984.
[7] Simov K., Angelova G., Paskaleva E.
MORPHO-ASSISTANT: The Proper Treatment of Morphological
Knowledge. COLLING'90 - 13th Int. Conf. on Computational
Linguistics, Helsinki, 1990, vol 3:pp.455-457.