The ALEP platform provides for a Text Handling component, concerned with the SGML marking of input texts. After this pre-processing stage, the input for the parser is (simplified) as follows:
<P>
<S>
<W>John</W>
<W>loves</W>
<W>Mary</W>
<PT>.</PT>
</S>
</P>
But before being processing by the parser an intermediate step is defined, consisting in specialized mapping rules which associate the SGML-marked texts with linguistic descriptions formulated by the grammar writers. The importance of this mechanism for an efficient processing of grammar descriptions in ALEP has been commented a.o. in [Declerck and Maas (1997)] and [Theofilidis 1997]. For the default case, this kind of rules - the so-called tsls-rules (Text Structures to Linguistic Structures) - can have the form displayed in figure 1,
Figure 1: Correspondence established
between the partial
linguistic description and the <W> markup of text
where a correspondence is established between a certain text structure
(the words) and a class of linguistic descriptions (the type ld). It is also possible
to enrich the list of features associated with the SGML tag (here the
<W> markup),
which in figure 1 is empty.
So for example we wrote small taggers for the recognition of messy
details and fixed phrases. On the base of the output of those
taggers,
some enriched tsls-rules can be described, so for example if the tagger
recognizes and marks currency expressions:
<W TYPE="CURRENCY_MEASURE" ORIG="Dreiundvierzig Millionen Dollar">
Dreiundvierzig_Millionen_Dollar</W>
The corresponding tsls-rule will be like shown in figure 2,
Figure 2: Correspondence established
between the partial
linguistic description and the enriched <W> markup of text
where a value-sharing between a text feature and a feature of the linguistic description is defined. The lexicon entry referred to by the linguistic description has the particularity that it won't be accessed by its realization, but by the class of words it belongs too (see the value-sharings). This is possible because ALEP supports the definition of generic lexicon entries, as shown in figure 3. Generic entries allow to include in the ALEP lexicon in a very compact manner classes of expressions which usually are causing serious problems to the coverage of the grammar and to the performances of the parser.
Figure 3: Generic lexical entry for currency
expressions
This strategy has been extended to the output of a PoS-tagger. Here
again, the information delivered by an external tool has been
integrated into the grammar processing of ALEP via some tsls-rules.
So if, for example, following PoS information is delivered by a
tagger
,
STRING,CAT,STEM
Grosse,a,gross
Bereiche,n,bereich
...
this information can be integrated into the grammar processing via the tsls-rule displayed in figure 4.
Figure 4: Value-sharing of a feature of
the text structure tag <W> and the CAT feature of the
linguistic description
The importance of this addition of information can be showed at the
improved processing times of the parsers of the ALEP platform.
For the sentence ``Großße Bereiche der Dasa
leiden unter dem Rückgang des einst lukrativen
Rüstungsgeschäfts.'' (Large areas of the DASA are suffering from the decline of the once
lucrative arms trade.) the parsing time was
respectively 40.550 and 6.260 CPU for the basic and the record parser
of ALEP without the integration of PoS information and 13.820 and
4.850 with the integration of such information, a lot of
information being instantiated before the parsers start their
job
.
The improvement of performance is partly due to the fact
that after the segmentation of the input words has been achieved, the
grammar is concerned first with the reconstruction of the words,
enriched with linguistic information contained in the morpheme lexicon of
the grammar. For
this process affixes have been described as the parsing heads. And affixes (in
German) being
highly ambiguous (for example the affix en of the word
leiden above), there are several entries for some of them in the
morpheme lexicon, implying that the process of word construction will run
several times, also when not necessary. But once the PoS is known before the process
of word construction is started, the homograph affixes not corresponding to the
PoS won't be considered any longer, thus reducing the
search space for the parser. This remark being also valid for ambiguities at
the word level.
So our strategy will be profitable everywhere where the
input words are built with ambiguous affixes. In any case, the basic
parser will be (considerably) faster, whereas the record parser (which
already shows very good performances) will be
improved only in case we have really strong ambiguities at both
morpheme and word level.
We also expect another important improvement of performance, once a fast external morphological analysis will been integrated, delivering thus the parser of ALEP of the task of building word units out of the results of the actual Two-Level component of ALEP.