Strong Domain Variation and Treebank-Induced LFG Resources

John Judge, Michael Burke, Aoife Cahill, Ruth O'Donovan, Josef van Genabith, and Andy Way

Abstract

Proceedings of LFG05; CSLI Publications On-line

Probabilistic, treebank-based parsing resources (Collins (1999), Charniak (2000), Bikel (2002)) are of high quality and can be rapidly induced from appropriate treebank material. However, treebank- and machine learning-based grammatical resources reflect the characteristics of the training data. They generally underperform on test data substantially different from the training data. In this paper we investigate the effects of strong domain variation on the treebank-induced, ``deep'', probabilistic Lexical-Functional Grammar resources of Cahilletal (2004) and show how these resources can be adapted to handle strong domain variation. In our experiments, we use the Penn-II treebank (Marcus 1994) Wall Street Journal (WSJ) newspaper sections and the ATIS (Hemphill 1990) transcribed spoken language airline reservation resource. The Penn-II WSJ vs. ATIS domain change results in a markedly stronger drop in performance, both on the trees and the f-structures, for the Penn-II trained LFG resources of Cahill et al. (2004), compared to the drop observed by Gildea (2001) for the Penn-II WSJ vs. Brown domain variation experiments with Collins' (1997) parser.

This poses a research question: is the observed performance drop of the LFG resources of Cahill (2004) due to the decrease in quality of c-structure parsing, or is it a lack of coverage of the f-structure annotation algorithm (ibid.), or both? We report on experiments to answer this question. The main, and surprising, result is that, while the Penn-II trained c-structure component of Cahill (2004) requires retraining, the f-structure annotation algorithm (originally designed for Penn-II WSJ data) requires no changes or extensions. The linguistic information encoded in the f-structure annotation algorithm is already complete with respect to strong domain variation as exemplified between the Penn-II WSJ and ATIS corpora. This is a surprising result as Penn-II WSJ data represents a markedly different text domain to that of ATIS. A possible explanation is that, compared to c-structure, f-structure is a more abstract and ``normalised'' level of representation in the LFG architecture, less affected by domain variation than c-structure.