Treebank-Based Rapid Induction of Wide-Coverage Lexical-Functional Grammar Resources

Josef van Genabith

Abstract

Proceedings of LFG05; CSLI Publications On-line

Rich or deep (as opposed to "shallow") grammars map text to information and are at the core of many NLP applications.

The last 20 years have seen the development of a number of rich unification/constraint-based computational grammar formalisms, prominently among them Lexical-Functional Grammar (LFG) and Head-Driven Phrase Structure Grammar (HPSG). Developing grammars in these formalisms is a highly knowledge-intensive task and grammars are typically hand-crafted. Scaling such grammars beyond small fragments to unrestricted, naturally occurring, real text, is very time-consuming and expensive, involving, as it does, person years of expert labour. The situation is familiar from other knowledge-intensive engineering tasks in traditional rule-based, "rationalist" approaches in AI and NLP: it is an instance of the famous knowledge acquisition bottleneck.

At the same time, much recent work in NLP is corpus-based following what has been referred to as an "empiricist" research tradition: as one example, treebanks are available for increasing numbers of languages, and treebank-based, probabilistic grammar induction and parsing is a cutting-edge research paradigm. Such approaches are attractive as they achieve wide coverage, robustness and performance on 'real' data while incurring very low grammar development cost. With a number of notable exceptions, however, most of the induced grammars are "shallow", i.e. they do not map text to information and even the exceptions are substantially less detailed than current unification/constraint-based grammars such as LFG and HPSG in the rationalist paradigm.

This situation poses a research question: is it possible to combine rationalist and empirical research methods to induce rich, wide-coverage, unification grammars from treebanks? In this tutorial we show how wide-coverage, probabilistic LFG grammatical resources can be induced from automatically f-structure annotated treebanks following [Cahill et al., 2002, Burke et al., 2004, Cahill et al., 2004, O'Donovan et al., 2004].

We will outline f-structure annotation algorithms, the extraction of grammatical and lexical resources, and present parsing results on the WSJ section of the Penn-II treebank equal or better than those achieved by the best state-of-the-art hand-crafted grammars. We relate our approach to traditional, manual grammar development. We outline how the method can be applied to multilingual, rapid LFG grammar induction for German, Spanish and Chinese and how it can be used to bootstrap treebank construction. We also compare our approach to similar approaches for Combinatory Categorial Grammars [Hockenmaier and Steedman, 2002] and Head-Driven Phrase Structure Grammar [Miyao et al. 2004]. Time permitting, we will demo some of the LFG systems.

Prerequisites: working familiarity with Lexical-Functional Grammar (or similar constraint-based formalism).

References

Burke, M., A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. 2004. The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank, In Proceedings of the Ninth International Conference on LFG, Christchurch, New Zealand (to appear).
Cahill, A., M. McCarthy, J. van Genabith and A. Way, 2002. Parsing with PCFGs and Automatic F-Structure Annotation. In M. Butt and T. Holloway-King (eds.) Proceedings of the 7th International Conference on LFG. Available on-line at http://cslipublications.stanford.edu/LFG/7/lfg02-toc.html. Stanford, CA: CSLI.
Cahill, A., M. Burke, R.O'Donovan, J. van Genabith and A. Way, 2004. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. ACL'04, Barcelona, Spain.
Hockenmaier, J. and Steedman, M. Generative Models for Statistical Parsing with Combinatory Categorial Grammar. Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, 2002.
Miyao, Y., T. Ninomiya, and J. Tsujii. 2004. Corpus-oriented Grammar Development for Acquiring a Head-driven Phrase Structure Grammar from the Penn Treebank. Proceedings of IJCNLP-04.
O'Donovan, R., M. Burke, A. Cahill, J. van Genabith and A. Way, 2004, Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank. ACL'04, Barcelona, Spain.