qats2016 by qats2016

Motivation

By organising a shared task, we wish to bring together researchers working on automatic evaluation and quality estimation of machine translation (MT) output, to participate and try to adapt their metrics to the closely related task of automatic text simplification (ATS). This would provide an opportunity to establish some metrics for automatic evaluation of ATS systems and enable their direct comparison in terms of the quality of the generated output, as well as less time consuming assessment of each ATS system.

Description

The shared task consists in developing a method/metric for automatic assessment of the quality of the automatically simplified English sentences. Each of the sentences has been assigned into one of the following three classes: "good", "ok" and "bad" by human evaluators. These labels will be used as reference classes.

Ideally, the metric should work as a classifier which automatically assigns each sentence into one of the three classes. Alternatively, it is also possible to submit the raw scores of the metric for each sentence.

Data sets

The provided original sentences are from news domain and Wikipedia. Their corresponding automatically simplified sentences were obtained by various automatic text simplification systems and thus cover different simplification phenomena (only lexical simplification, only syntactic simplification, mixure of lexical and syntactic simplification, content reduction, etc.).

To each simplified sentence, human evaluators assigned one the three classes ("good", "ok", "bad") for each of the following four aspects:

Grammaticality (bad - ungrammatical, OK - somewhat ungrammatical but the mistakes do not impede understanding, good - completely grammatically correct)
Meaning preservation (bad - no meaning at all or completely opposite meaning from the original, OK - somewhat changed nuance of meaning or missing an unimportant part, good - preserved original meaning)
Simplicity (bad - difficult to understand, OK - somewhat difficult to understand, good - easy to understand)
Overall - an overall score that represents a combination of the previous three scores which penalises more meaning preservation and simplicity than grammaticality. It should help deciding whether the automatically simplified sentence is ready to be presented to a final user (good), needs post-editing (OK), or should better be discarded and simplified using some different technique or left in the original form (bad).

The training set can be downloaded here:

training set

It consists of 505 sentence pairs and four human scores: original sentence (Original), simplified sentence (Simplified), Grammaticality score (G), Meaning preservation score (M), Simplicity score (S), and Overall score (Overall).
The Grammaticality score (G) and Simplicity score (S) are assigned only on the basis of the simplified sentence (Simplified).
The Meaning preservation score (M) and the Overall score (Overall) take into account both original sentence (Original) and its automatically simplified version (Simplified).

The test set can be downloaded here:

test set

Human evaluation results available here:

manual annotations for test set

It consists of 126 sentence pairs. The proportions of sentences from different ATS systems are the same for both training and test set.

Tracks

There are two tracks: constrained and unconstrained. In the constrained track, you are allowed to use only the dataset provided by us (505 sentence pairs for training). In the unconstrained track, you can additionaly use any other data for training.

Submission

Test set has been released on 20th January 2016.
The participants should submit their metrics on 3rd February 2016 by mail to maja.popovic@hu-berlin.de and stajner.sanja@gmail.com.
The submission should be in the following format:

metric-name  sentence-number     class-value     score-value
my-metric	 1		     good	     1.78446
my-metric	 2		     ok		     5.89343
...

It is possible to submit both classes and scores.
Participants submitting only one of the value types should assign "x" to the whole column for the other value type.

Each participating team can send at most three different methods/metrics for each of the four aspects (G, M, S, Overall). The metrics will be evaluated separately for each of the four aspects.

Evaluation

For participants submitting classes, the metrics will be evaluated using accurracy (F-score), i.e. percentage of sentences with correctly predicted class normalised over the total number of sentences.
For participants submitting raw scores, the metrics will be evaluated using Pearson's and Spearman's correlation coefficients.

Results

Shared task results are available here:

Accuracy, mean absolute error, root mean squared error, weighted F-score:

Pearson's correlation coefficients:

qats2016

LREC 2016 Workshop & Shared Task on

Quality Assessment for Text Simplification (QATS)

28th May 2016