next up previous
Next: Concluding Remarks and Future Up: A Migration between LE Previous: A Reverse Migration?

The Integration of LE Tools into the ALEP Platform

The ALEP platform provides for a Text Handling component, concerned with the SGML marking of input texts. After this pre-processing stage, the input for the parser is (simplified) as follows:

       <P>
         <S>
           <W>John</W>
           <W>loves</W>
           <W>Mary</W>
           <PT>.</PT>
        </S>
       </P>

But before being processing by the parser an intermediate step is defined, consisting in specialized mapping rules which associate the SGML-marked texts with linguistic descriptions formulated by the grammar writers. The importance of this mechanism for an efficient processing of grammar descriptions in ALEP has been commented a.o. in [Declerck and Maas (1997)] and [Theofilidis 1997]. For the default case, this kind of rules - the so-called tsls-rules (Text Structures to Linguistic Structures) - can have the form displayed in figure 1,

 
Figure 1:  Correspondence established between the partial linguistic description and the <W> markup of text

where a correspondence is established between a certain text structure (the words) and a class of linguistic descriptions (the type ld). It is also possible to enrich the list of features associated with the SGML tag (here the <W> markup), which in figure 1 is empty. So for example we wrote small taggers for the recognition of messy details and fixed phrases. On the base of the output of those taggers, some enriched tsls-rules can be described, so for example if the tagger recognizes and marks currency expressions:

 
      <W TYPE="CURRENCY_MEASURE" ORIG="Dreiundvierzig Millionen Dollar">
         Dreiundvierzig_Millionen_Dollar</W>

The corresponding tsls-rule will be like shown in figure 2,

 
Figure 2:  Correspondence established between the partial linguistic description and the enriched <W> markup of text

where a value-sharing between a text feature and a feature of the linguistic description is defined. The lexicon entry referred to by the linguistic description has the particularity that it won't be accessed by its realization, but by the class of words it belongs too (see the value-sharings). This is possible because ALEP supports the definition of generic lexicon entries, as shown in figure 3. Generic entries allow to include in the ALEP lexicon in a very compact manner classes of expressions which usually are causing serious problems to the coverage of the grammar and to the performances of the parser.

 
Figure 3:  Generic lexical entry for currency expressions

This strategy has been extended to the output of a PoS-tagger. Here again, the information delivered by an external tool has been integrated into the grammar processing of ALEP via some tsls-rules. So if, for example, following PoS information is delivered by a taggergif,

       STRING,CAT,STEM
       Grosse,a,gross
       Bereiche,n,bereich
       ...

this information can be integrated into the grammar processing via the tsls-rule displayed in figure 4.

 
Figure 4:  Value-sharing of a feature of the text structure tag <W> and the CAT feature of the linguistic description

The importance of this addition of information can be showed at the improved processing times of the parsers of the ALEP platform. For the sentence ``Großße Bereiche der Dasa leiden unter dem Rückgang des einst lukrativen Rüstungsgeschäfts.'' (Large areas of the DASA are suffering from the decline of the once lucrative arms trade.) the parsing time was respectively 40.550 and 6.260 CPU for the basic and the record parser of ALEP without the integration of PoS information and 13.820 and 4.850 with the integration of such information, a lot of information being instantiated before the parsers start their jobgif. The improvement of performance is partly due to the fact that after the segmentation of the input words has been achieved, the grammar is concerned first with the reconstruction of the words, enriched with linguistic information contained in the morpheme lexicon of the grammar. For this process affixes have been described as the parsing heads. And affixes (in German) being highly ambiguous (for example the affix en of the word leiden above), there are several entries for some of them in the morpheme lexicon, implying that the process of word construction will run several times, also when not necessary. But once the PoS is known before the process of word construction is started, the homograph affixes not corresponding to the PoS won't be considered any longer, thus reducing the search space for the parser. This remark being also valid for ambiguities at the word level. So our strategy will be profitable everywhere where the input words are built with ambiguous affixes. In any case, the basic parser will be (considerably) faster, whereas the record parser (which already shows very good performances) will be improved only in case we have really strong ambiguities at both morpheme and word level.

We also expect another important improvement of performance, once a fast external morphological analysis will been integrated, delivering thus the parser of ALEP of the task of building word units out of the results of the actual Two-Level component of ALEP.


next up previous
Next: Concluding Remarks and Future Up: A Migration between LE Previous: A Reverse Migration?

Thierry Declerck
Sat Sep 6 17:29:07 MET DST 1997