By organising a shared task, we wish to bring together researchers working on automatic evaluation and quality estimation of machine translation (MT) output, to participate and try to adapt their metrics to the closely related task of automatic text simplification (ATS). This would provide an opportunity to establish some metrics for automatic evaluation of ATS systems and enable their direct comparison in terms of the quality of the generated output, as well as less time consuming assessment of each ATS system.
The shared task consists in developing a method/metric for automatic assessment of the quality of the automatically simplified English sentences. Each of the sentences has been assigned into one of the following three classes: "good", "ok" and "bad" by human evaluators. These labels will be used as reference classes.
Ideally, the metric should work as a classifier which automatically assigns each sentence into one of the three classes. Alternatively, it is also possible to submit the raw scores of the metric for each sentence.
The provided original sentences are from news domain and Wikipedia. Their corresponding automatically simplified sentences were obtained by various automatic text simplification systems and thus cover different simplification phenomena (only lexical simplification, only syntactic simplification, mixure of lexical and syntactic simplification, content reduction, etc.).To each simplified sentence, human evaluators assigned one the three classes ("good", "ok", "bad") for each of the following four aspects:
- Grammaticality (bad - ungrammatical, OK - somewhat ungrammatical but the mistakes do not impede understanding, good - completely grammatically correct)
- Meaning preservation (bad - no meaning at all or completely opposite meaning from the original, OK - somewhat changed nuance of meaning or missing an unimportant part, good - preserved original meaning)
- Simplicity (bad - difficult to understand, OK - somewhat difficult to understand, good - easy to understand)
- Overall - an overall score that represents a combination of the previous three scores which penalises more meaning preservation and simplicity than grammaticality. It should help deciding whether the automatically simplified sentence is ready to be presented to a final user (good), needs post-editing (OK), or should better be discarded and simplified using some different technique or left in the original form (bad).
The training set can be downloaded here:
The Grammaticality score (G) and Simplicity score (S) are assigned only on the basis of the simplified sentence (Simplified).
The Meaning preservation score (M) and the Overall score (Overall) take into account both original sentence (Original) and its automatically simplified version (Simplified).
The test set can be downloaded here:
Human evaluation results available here:
There are two tracks: constrained and unconstrained. In the constrained track, you are allowed to use only the dataset provided by us (505 sentence pairs for training). In the unconstrained track, you can additionaly use any other data for training.
Test set has been released on 20th January 2016.
The participants should submit their metrics on 3rd February 2016 by mail to email@example.com and firstname.lastname@example.org.
The submission should be in the following format:
metric-name sentence-number class-value score-value my-metric 1 good 1.78446 my-metric 2 ok 5.89343 ...
It is possible to submit both classes and scores.
Participants submitting only one of the value types should assign "x" to the whole column for the other value type.
Each participating team can send at most three different methods/metrics for each of the four aspects (G, M, S, Overall). The metrics will be evaluated separately for each of the four aspects.
For participants submitting classes, the metrics will be evaluated using accurracy (F-score), i.e. percentage of sentences with correctly predicted class normalised over the total number of sentences.
For participants submitting raw scores, the metrics will be evaluated using Pearson's and Spearman's correlation coefficients.
Shared task results are available here:
- Accuracy, mean absolute error, root mean squared error, weighted F-score: