Automatic checking of the syntactic agreement

Theoretical background

The main purpose of theoretical investigation is to formalise the grammatical rules related to the syntactic agreement as well as to find appropriate data structures and effective algorithms for checking the agreement. This is a very difficult task for the Bulgarian language because of many exclusions of the grammatical rules, what supposes wrong results. Furthermore on this stage of our investigation (cause of lack of semantic information) the checking of words agreement is possible only on the basis of the grammatical features of the words. They are treated as parts of speech.

For every pair of neighbouring words in the sentence a checking of right or wrong syntactic agreement between them can be performed. A table of agreement is needed for this purpose. For every cell of the table a list of rules has been defined for right syntactic agreement between related to the cell parts of speech. For example if the former word of the pair is an adjective and the latter word is a noun the agreement rules from the cell marked with '+' in the table bellow are extracted.

  1.

Noun

2.

Adj.

3.

Pronoun

4.

Num.

5.

Verb

6.

Adv.

7.

Part.

8.

Prep.

9.

Conj.

1.Noun                  
2.Adj. +                
3.Pronoun                  
4.Num.                  
5.Verb                  
6.Adv.                  
7.Part.                  
8.Prep.                  
9.Conj.                  

Filling in the table we have to bear in mind all possible variants of right agreement of two words. For describing the grammatical rules only morpho-syntactic information about the words is available. That means we know what part of speech is every word as well as the grammatical features (gender, number, person ) if the word is inflecting. A table is used for checking the right agreement of two words as follows:

  1. Determining what part of speech is the former word we determine the row of the table in which the cell with rules is situated.
  2. Determining what part of speech is the latter word we determine the column of the table in which the cell with rules is situated.
  3. The agreement is right if the grammatical features satisfy some rule from the list of rules belonging to this cell.

An appropriate data structure allowing fast performing of this process is necessary.

A general approach and realisation

The automatic checking of agreement is possible for neighbouring words. For every word information has been extracted from the base morphological dictionary. This information is compared with the information stored in the rules of the table of agreement. For every word an inflectional type number and a word-form number are extracted as a result of the morphological analysis. For the word pair the following sequence is obtained: f1, g1, f2, g2,

where f1, f2 are the inflectional type numbers of the words; g1, g2 are the word-form numbers of the words.

Let Ai (i=1..10) is the set of all possible pairs (f,g), related to all parts of speech (e.g. if i=1 the part of speech is a noun; i=2 - an adjective etc.; if the word is not inflecting then g=1)

Ai = { (f,g) , f Ti , g Si } , i=1..10

where Ti consists of the inflectional type numbers of the i-th part of speech, Si comprises the word-form numbers of the i-th part of speech..

We define a binary function Agr : Ai x Ai --> { TRUE, FALSE },

where Ai x Ai is the Cartesian product of the set Ai with itself. The cell (i,j) from the agreement table consists of a list of all these pairs (of pairs) (fi,gi),(fj,gj) for which

Agr((fi,gi),(fj,gj)) = TRUE

For example the sequence

f1 g1 f2 g2
82 8 48 1

means that:

and the agreement between two neighbouring words whit above described features is right

The agreement table could be represented as follows:

Every cell comprises a pointer to a list of right agreement rules, presented through sequences (like the above-mentioned). More effectively these sequences can be reduced if we use not inflectional type numbers but sets of these numbers, related to the parts of speech. Because these sets consist of ordered numbers the first and the last numbers belonging to the set fully represent it. In this way every rule can be described as the following sequence:

First fi1, Last fi1 , gi1, First fj2, Last fj2, gj2, where

At last we can present the agreement table as a matrix of pointers to lists of rules consisting of 6 member sequences.

For example the marked cell (2,4) from the agreement table is fully described as follows:

First word Second word First f21 Last f21 g21 First f42 Last f42 g42
masc., sing. masc., sing. 76 89 1 1 40 1
masc.,sing. sh.def.art. masc., sing. 76 89 5 1 40 1
masc.,sing,full def.art. masc., sing. 76 89 6 1 40 1
fem., sing. fem., sing. 76 89 3 41 53 1
fem., sing. def. art. fem., sing. 76 89 8 41 53 1
neut., sing. neut., sing. 76 89 4 54 73 1
neut., sing. def. art. neut., sing. 76 89 9 54 73 1
pl. pl. 76 89 2 1 75 4
pl. def. art. pl. 76 89 7 1 75 4

An algorithm for the agreement checking

First fi1, Last fi1 , gi1, First fj2, Last fj2, gj2, is looking up, where:

  1. First fi1 < f1 < Last fi1 AND g1 = gi1 AND
  2. First fj2 < f2 < Last fj2 AND g2 = gj2

1. if First fi1 < f1 < Last fi1 AND g1 = gi1 g2 is wrong. It has to get the value of gj2

2. if First fj2 < f2 < Last fj2 AND g2 = gj2 g1 is wrong. It has to get the value of gi1


HOME