LANGUAGE PROCESSING TOOLS
FOR BULGARIAN

 

Christo T. Tanev, Hristo Krushkov

Department of Computer Science

University of Plovdiv “Paisii Hilendarski”

24 Tzar Assen St, 4000 Plovdiv, Bulgaria

chritan@ulcc.uni-plovdiv.bg, hdk@ulcc.uni-plovdiv.bg

Abstract

This paper describes a set of language processing tools for Bulgarian: (i) morphological processor, (ii) integrated system for language processing and (iii) evaluator. These programs provide the basic tools for automatic language processing for Bulgarian.

A morphological processor for Bulgarian was built on the basis of the common properties to which inflective morphology submits. It is a tool performing automatic morphological analysis and synthesis. Procedures for unknown-word guessing are implemented in BULMORPH. The morphological dictionary contains 67500 base forms which cover over 1.5 million word forms.

Rules for syntactic agreement are described in BULMORPH.

The integrated system for language processing LINGUA incorporates part-of-speech tagger, sentence splitter, noun phrase parser and other language-processing modules.

The evaluator can evaluate the output of LINGUA and BULMORPH by comparing it with manually annotated corpora.

Keywords

agreement error correction, evaluation, lexicography, morphological analysis, part of speech tagging, unknown word guessing

 

1 Introduction

There is a gap between the need of linguistic software for Bulgarian and the linguistic software in use. Spell-checking dictionaries of Bulgarian, bilingual Bulgarian - English, Bulgarian - German and other computer dictionaries are used. Few programs for language processing of Bulgarian are created, though there are formal grammar theories about Bulgarian grammar [Ïåí÷åâ 1993].

We tried to make up for the deficiency in linguistic software.

We developed the morphological analyzer BULMORPH and the program for language processing of Bulgarian texts LINGUA which automatically performs POS tagging, NP parsing , sentence, clause and paragraph splitting.

Tool for evaluation was developed to provide the linguists with a means of evaluation of the software output against an annotated corpora.

All the linguistic programs are designed to work together, their input and output format contribute to the joint use of the programs.

The morphological analyzer BULMORPH can work as an independant program. Its is also a built-in module of the integrated program for language processing LINGUA.

LINGUA incorporates part-of-speech tagger, NP parser, sentence splitter, paragraph splitter, clause chunker and section heading indentifier.

The output of LINGUA has format which can be evaluated by the evaluator ChriEVAL .

2 BULMORPH – Bulgarian Morphological Processor

A morphological processor for Bulgarian was built on the basis of the common properties to which inflective morphology submits [Totkov, Krushkov, Krushkova’88]. It is a tool performing automatic morphological analysis and synthesis. Bulgarian language belongs to the group of inflective languages. Bulgarian inflection is described as a number of grammatical rules. A classification of Bulgarian inflection in view of the mentioned rules and the grammatical features of the words is made. There are 187 different inflectional types in that classification divided into parts of speech (POS). From a mathematical point of view the Bulgarian words are divided into disjoint classes of equivalence. Every class has a unique machine number for identification and a list of rules for generation of the paradigm. A part of speech is a set of classes. Every set can be divided into subsets depending on criteria pertaining to this particular part of speech. Two words are in the same class if their paradigms are generated in the same way. The paradigm is described as a list of wordforms with specific grammatical features for each of them. Every wordform also has a number. Two wordforms with equal numbers have the same grammatical features.

For example in the paradigm of the adjectives, wordform num. 1 has grammatical features (masc., sing.); wordform num. 2 has grammatical features (pl.); etc. For all parts of speech wordform num. 1 is the base (citation) form. Every wordform obtains 2 formal features:

The inflectional type number determines the part of speech the analyzed word belongs to.

2.1 Morphological synthesis

For every word a pattern is built up. The pattern and the inflectional type number determine the paradigm of that word. The pattern shows which letters are constant in all wordforms in the paradigm of the word and which are changing. The changing letters are marked with ‘*’ in the pattern. The pattern can be extracted from the paradigm automatically [Krushkov, Tanev, Krushkova’96]. The pattern involves some features important for morphological analysis as follows:

  • The length of the pattern is less than or equal to the length of every wordform of the paradigm generated from that pattern.
  • The pattern matches the beginning of every wordform with a full coincidence of the constant letters.

The rules for the generation of wordforms are of two types:

  • Replacing the ‘*’ with a letter (including the empty one ‘’).
  • Appending endings.

The pattern and the inflectional type number incorporate information for the whole paradigm of a particular word. The inflectional type involves for every wordform a list of letters (including the empty character) for replacing the symbol ‘*’ of the pattern and morphemes for appending the pattern.

The morphological generation of the paradigm is based on the following simple mechanism:

Every wordform can be constructed from the pattern operating with the rules of replacing(*/letter) and appending(+ morpheme), described after the wordform number. Once extracted, the rules for a member of some inflectional type are the same for all other members of this type.

2.2 Morphological analysis

The goal of the automatic morphological analysis is to perform automatically a morphological classification of an arbitrary wordform. This includes identifying the base form of the word, its grammatical features and to which inflectional type (part of speech) it belongs. In case of homonyms (when the wordform belongs to more than one inflectional type and has different grammatical features) all possible types must be found.

A machine dictionary consists of <word-pattern, inflectional type number> pairs. When an arbitrary wordform has to be classified the analyzer looks up a matching word-pattern in the dictionary. If such a pattern has been found, using the second part of the entry pair (inflectional type number) the rules are extracted from the generation table. On the basis of these rules a paradigm is generated from this pattern. If the analyzed word coincides with a wordform from the generated paradigm it obtains the grammatical features of that wordform. In such way the word is morphologically completely determined.

2.3 Robust Morphological Analysis

The robust morphological analyzer [Totkov, Krushkov’89] makes probabilistic morphological classification of ”unknown” word-forms, using an auxiliary dictionary of word-endings. The robust analysis algorithm allows to classify words as well as their inflectional forms which are not presented in the dictionary. The algorithm is based on the links between the word-form endings and corresponding grammatical information. There are 2 ways for hypothetical morphological classification of unknown words:

  • Comparison between the ending of the word-form under consideration and of a pattern word-form (stored in a dictionary with related grammatical information);
  • Recursive step by step separation of all possible prefixes from the word-form and analysis of the right part of the separation.

In both cases an analyzed word-form obtains the grammatical features of the word stored in the dictionary with maximum number of matching letters belonging to the endings of the compared words. In case a) a model (inverted) dictionary is needed, in case b) procedures for automatic word separation are expected.

The model dictionary has been created as fallows: Through a morphological synthesis all possible word-forms of the nouns, adjectives and verbs are generated. The set of these word-forms is divided into disjoint classes (intervals). All word-forms in a class have the same grammatical features and a pattern, which matches all of them. The citation forms of the word-forms left out of the intervals (exceptional words) are stored in a dictionary, containing also non inflected parts of speech and pronouns. After the striking off the exclusive words a new compressing of the interval dictionary is possible. The neighbouring intervals with the same grammatical information joint. In this way the number of patterns is reduced.

The presented approach allows to decrease the volume of the dictionary necessary for the analysis and to perform analysis for unknown (without citation form in the dictionary) or misspelled words.

2.4 Automatic Checking of the Syntactic Agreement

The main purpose of theoretical investigation is to formalize the grammatical rules related to the syntactic agreement as well as to find appropriate data structures and effective algorithms for checking the agreement. This is a very difficult task for the Bulgarian language because of many exceptions of the grammatical rules, what supposes wrong results. Furthermore on this stage of our investigation (because of lack of semantic information) the checking of words agreement is possible only on the basis of the grammatical features of the words. They are treated as parts of speech.

For every pair of neighbouring words [Krushkov, Krushkova 94-B] in the sentence a checking of right or wrong syntactic agreement between them can be performed. A table of agreement is needed for this purpose. For every cell of the table a list of rules has been defined for right syntactic agreement between parts of speech related to the cell. Filling in the table we have to bear in mind all possible variants of right agreement of two words. For describing the grammatical rules only morpho-syntactic information about the words is available. That means we know what part of speech is every word as well as the grammatical features (gender, number, person) if the word is inflected. A table is used for checking the right agreement of two words as follows:

  • Determining what part of speech is the former word we determine the row of the table in which the cell with rules is situated;
  • Determining what part of speech is the latter word we determine the column of the table in which the cell with rules is situated.

The agreement is right if the grammatical features satisfy some rule from the list of rules belonging to this cell.

The automatic checking of agreement is possible for neighbouring words. For every word information has been extracted from the morphological dictionary. This information is compared with the information stored in the rules of the table of agreement. If the agreement is wrong hypotheses for right agreement are suggested.

3 LINGUA - integrated linguistic software system

LINGUA is a program which supplies performance of the basic linguistic processing tasks; it can be used for text processing before other linguistic tasks. After the processing, SGML markers are inserted in the text, they provide basic linguistic information (sentence, paragraph and clause boundaries, part of speech tags, NP boundaries, headings of sections). SGML marking is optional and can be configured by the user.

The SGML output of LINGUA contains all the information gained in the process of analysis. It may be passed directly as input of other linguistic processing programs.

3.1 Tools for Linguistic Processing of Bulgarian Texts

Series of text processing tools were developed and implemented in “Lingua”. These tools are intended for processing texts in Bulgarian. Some of the tools have their best performance on texts which belong to the genre “Technical manuals” and formal texts such as instructions, abstracts.

Linguistic processing tools are tokeniser , sentence splitter, section heading extractor, clause chunkier , morphological disambiguator and POS tagger, noun phrase extractor .

All the linguistic processing tools are implemented in the integrated system for linguistic processing LINGUA

3.1.1 Linguistic processing pipeline

“Lingua” analyses text given on its input as a text file. The program automatically performs: sentence splitting, tokenisation, morphological analysis with a computer dictionary of Bulgarian language and procedures of approximate morphological analysis, morphological disambiguation through grammatical rules and heuristics, NP extraction with simple NP grammar, segmentation of complex sentences into simple ones. See figure 1.

“Lingua” uses pipeline technology of processing. Every processing stage gives its output to the input of the next stage. The text is processed in small parts, which are stored in the memory. In this way interstitial file output is avoided, which increases the speed.

3.1.2 Tokenisation and Sentence splitting

Tokenisation is performed in two stages. First simple tokens are extracted and temporary saved in a queue. After that token synthesis rules are used to form more complex tokens; e.g. expressions in Latin letters form one token (“Word for Windows 6.0” is considered one token). In order to improve tokenisation, dictionary of abbreviations is implemented in “Lingua”.

Sentence splitting influences the precision of all the stages of linguistic processing, because its execution is in the beginning of the pre-processing (look at figure 1). In view of this, the performance of sentence splitter is important. The text is divided into sentences through rules for identifying sentence boundaries.

Rule 1: <.> / <text with capital letter>

Rule 2: <:> / <Text with capital letter>

Rule 3: <new line> / <Text with capital letter>

There are 9 end-of-sentence rules, implemented in the system.

The precision on text with 190 sentences is 92% and the recall is 99%

figure 1  The data-flow diagram of “Lingua” / Rectangles are algorithms /LingTunis.gif (8736 bytes)

3.1.3 Paragraph marking

The beginning and the end of the paragraphs are identified by the left indent in the beginning of the paragraph, or by the space between the end of the last sentence in the previous paragraph and the right margin. An experiment on 268 paragraph marks shows 94% precision and 98% recall.

3.1.4 Morphological analysis

Every token is analyzed morphologically. First precise analysis is performed through the BULMORPH analyser. Words in Latin letters, beginning with capital letter are considered nouns (“Word 97”, “Borland Pascal”, “MS DOS”, “Windows”, “C”.

If precise analysis fails, BULMORPH applies its own procedures for guessing unknown words.

3.1.5 Morphological disambiguation

Morphological disambiguation is performed through grammar rules and heuristics, which take into account the context of the word and choose only one morphological hypothesis.

Example for such a heuristic (specific for Bulgarian):

If X is an homonym

with two hypotheses adverb or adjective

and after X adjective or noun appears, which is not neutral gender or its number is plural

Then

X is adverb

The program uses 33 heuristics and grammar rules for disambiguation. They are manually obtained. There are no big corpora of electronic texts in Bulgarian, so we had to leave the idea for stochastic tagger such as HMM tagger and studied manually the reflection of the syntax on the linear ranging of the words in the sentence. Still more Atro Voutilainen claims in “A syntax based part-of-speech tagger” that the precision of the rule-based taggers is not lower than the precision of the stochastic ones [Voutilainen’95].

The rules take into account the context of the ambiguous word , which begins one or two words before and finishes one or two words after it.

If the rules fail to resolve ambiguity, then frequent table is used and the grammar form, which is used more frequently in the genre is preferred.

The precision of the part of speech tagger is about 95%.

3.1.6 NP extraction

The module for noun phrase extraction uses the following grammar:

NP ’ -> (Dm pron)( Qu) (AdjP) N

AdjP ->(Adv) (ïî-| íàé-)Adj(Prep Pron)

NP -> N | (NP,)N NP è (and) NP|Np’ Prep Np

N - noun, Adj - adjective, Qu - quantifier: “many”, “little”, “some”, numeral.

Pron - pronoun

 

NP-s are represented as trees, which are part of the structure representing the sentence.

Every word and NP in the sentence is represented through an attribute structure, which contains data for the type of the word (phrase), gender, number , definiteness, person(for verbs, pronouns).

The recall of NP extraction, measured against 352 NPs is 77% and the precision - 63.5%

3.1.7 Clause identification in simple sentences

A heuristic algorithm for identification of simple-sentence boundaries is implemented in “Lingua”. The complex sentences are divided into simple sentences in the following way: finite verb forms are sought for - every finite verb identifies one simple sentence. After that boundaries between simple sentences are tried to be identified. First complex demarcating phrases are sought :”çà äà” (“in order to”), “òúé êàòî”, also k-words (Bulgarian equivalent of w-words) :”çàùî” (“why”), “êîé” (“who”) . If between two finite verbs such forms are not found, conjunctions are searched for, without “è”, “èëè” (“and” , “or” ) (as they take part in NP formation). After that “è”, “èëè” (“and” , “or” ) and comma are looked for as boundary. After unsuccessful boundary search, slash and adverbs are sought which also can be on the boundary of two simple sentences.

The precision on text with 97 clause boundaries is 71% and the recall 81%.

3.1.8 Section heading extraction

“Lingua” uses two heuristics for recognizing section captions:

1. A single sentence in a paragraph without finite verb forms is a caption (specific for Bulgarian technical manuals). Most of the section headings in technical manuals are non verb sentences - (“Ðàáîòà ñ Word 97” literally “Work with Word 97”), actions are expressed through verbal nouns (“ðàáîòà” instead of “ðàáîòÿ”, “ñúçäàâàíå” instead of ”ñúçäàì” etc.).

2. A single sentence in a paragraph with capital letters is a caption.

The information provided from the linguistic processing is inserted in the text through SGML markers:

Ùðàêíåòå âúðõó <NP n=1>êàðòèíàòà â <NP n=2>òðåòàòà êîëîíà íà<NPn=3> òðåòèÿ ðåä</NP n= 3></NP n=2></NP n=1> <EOS>

Click on the <NP n=1>picture in <NP n=2> the third column of <NP n=3> the third line </NP n=3></NP n=2></NP n=3><EOS>

3.2 Software implementation

“Lingua” is a 32 - bit Windows application, it works in Windows 95/98 environment.

Following dictionaries are used by “Lingua”:

1) Morphological dictionary of Bulgarian language

2) Dictionary of abbreviations with 59 abbreviations

The text for analysis must be a text file in ASCII or ANSI encoding. The program processes only “pure” text files. The user can choose the encoding of the input and output file. SGML marking is configured by user through a dialog box. As the output can be configured according to the user preferences and every stage of processing can output its results in the output file, LINGUA works as an integrated set of linguistic tools. In the table below are listed the precision and recall of the tools and SGML markers, which are used by every tool to insert results of its analysis.

RESULTS:

Different language - processing stages in LINGUA have different precision. As the tools work together as a pipeline for language processing, the precision of the first level is crucial The first level of processing is sentence splitting, paragraph splitting and part of speech tagging. Their precision is comparatively good, especially for POS tagger - 95%. If we consider the fact that tagger works with manually extracted rules, 95% is a good beginning.

Tool Precision Recall SGML tag
sentence splitter 92% 99% <EOS>
paragraph splitter 94% 98% <P>
clause chunker 71% 81% <CLAUSE>
POS tagger 95% - </PosTag 71,1>
NP parser 63,5% 77% <NP n= 1> </NP n= 1 >

 

 

4 ChriEVAL - a tool for evaluation

ChriEVAL is a program which compares SGML marked text files. The program takes as its input two SGML marked texts, one is the SGML output of a linguistic software and the second is manually annotated SGML corpora.

The user chooses SGML tags to be compared. Although many different kinds of tags may be used in both files, the program takes into account just the precision and recall of the tag chosen by the user.

The program can process two basic class of tags - single tags, such as “end of sentence “ tag <EOS>; example for the second kind are NP tags - every noun phrase is marked with two tags - one for beginning: <NP>, and one for end of the phrase : </NP>.

The tag is supplied by the user, its type is also chosen. After that the positions of the chosen tags are compared in both files.

In comparing the positions of the tags, ChriEVAL takes into account the number of symbols from the beginning of the file to the beginning of the tag, where spaces, end of line symbols and other SGML markers are ignored. This means that ChriEVAL evaluates properly even if the evaluated output is changed from the original through inserting spaces, new lines and SGML markers.

If Nc is the number of the tags, whose positions coincide in both files, Na is the number of the tags in annotated corpora and No is the number of the tags in the file-output, then precision is calculated through

(1) Precision = Nc/ No

and recall is

(2) Recall = Nc/Na

ChriEVAL outputs the results on the screen right after the processing.

The performance of LINGUA was evaluated with ChriEVAL because its output is SGML marked file, see fig.1. The evaluation statistics in chapter III is obtained through ChriEVAL. ChriEVAL is a Windows application, it processes text files in ANSI/ASCII format.

This program can work jointly with the WinWord annotation tool, also with any SGML tagger.

 

References

Krushkov Hr., Krushkova M. “Methodology and Computational Tools for Compression and Searching in Machine Dictionaries”, Proceedings of XXIII-th spring conference of the Union of Bulgarian Mathematicians, April 1-4, 1994, Stara Zagora, pp. 388-394.

Krushkov Hr., Tanev Hr., Krushkova M., “Automatic extraction of a pattern and synthesis rules from the paradigm”, Proceedings of VII-th National Conference “Contemporary tendencies in the Development of the Fundamental and Applied Sciences”, 6-7 June 1996, Stara Zagora, Bulgaria, pp. 167-171.

Totkov G., Krushkov Hr. “Robust Morphological Analysis of Bulgarian Language”, Proceedings of conference Intelligent management systems, September, 1989, Varna-Druzhba, pp. 141-147.

Totkov G., Krushkov Hr., Krushkova M. “Formalisation of Bulgarian Language and the Development of a Linguistic Processor (Morphology)”, - Universite de Plovdiv, Travaux scientifiques, vol.26, fasc.3,1988-Mathematique.

Voutilainen Atro, “A syntax-based part-of-speech tagger”, Proceedings of the seventh conference of the European Chapter of Association for Computational Linguistics, Dublin 1995

Ïåí÷åâ É., “Áúëãàðñêè ñèíòàêñèñ. Óïðàâëåíèå è ñâúðçâàíå.” Ïëîâäèâ, 1993