Syntactic parsing

Parsing is the process of identifying the syntactic structure of a given sentence. A natural language parser is computer software that automatically performs parsing and outputs the structural description of a given character string relative to a given grammar. A grammar specifies how each sentence is constructed from parts. The output of a parser is a parse representing the structure of the analyzed language fragment.

The amount of available textual information has grown explosively in recent decades, because of Internet. Consequently, there has been an increasing demand for automatically processing the information conveyed by natural languages.

Parser evaluation

Evaluation has a crucial role in NLP. Evaluation methods and tools are needed to allow the developers and users to assess and enhance NLP systems. In recent years, NLP evaluation has drawn more and more interest. It has become clear that in order to advance the current state-of-the-art, standardized, wide-coverage evaluations and system comparisons must be conducted.

Despite its great importance to developing parsing systems, the task of evaluating the performance of a syntactic parser of natural language is poorly defined. A parser evaluation framework has to address the following four questions:

RobSet (download)

RobSet is a robustness evaluation test set for English. It consists of 1,362 test items, each with one to three misspelled words.

While the level of noise in the inputs increases, the performance of a parser degrades. The scale in which this occurs can be measured by increasing the number of mistakes in the input sentences and observing the effect on the performance of a parser. This kind of evaluations can be carried out by using the RobSet as a source for pairs of correct and erroneus sentences.

RobSet is free of charge for research purposes. If you use RobSet in your work, please reference the following:
Kakkonen, T.: Developing Parser Evaluation Resources for English and Finnish. Proceedings of the 3rd Baltic Conference on Human Language Technologies. Kaunas, Lithuania, 2007.