CSI5386

CSI5386, Fall 2017

Assignment 2

Due Oct 20, extended to Oct 27, 22:00

Text Entailment and Semantic Relatedness [100 points]

Note: You will work in groups of two students.

In this assignment, you will classify them as the first sentence (the premise) entailing the second sentence (the hypothesis), contradicting it, or bearing no relation. Therefore the three classes will be: ENTAILEMNT, CONTRADICTION, and NEUTRAL. An entailment relation exists if the premise sentence can cover the meaning of the hypothesis sentence. Read more about the two tasks at http://alt.qcri.org/semeval2014/task1/

You will also calculate the semantic relatedness of pairs of sentences.

You will use the SICK dataset, which consists of 10,000 pairs of sentences annotated for semantic relatedness and entailment. The SICK data set was built starting from two existing paraphrase sets: the 8,000 ImageFlickr data set (http://nlp.cs.illinois.edu/HockenmaierGroup/data.html)

and the SEMEVAL-2012 Semantic Textual Similarity Video Descriptions data set (http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data).

Each sentence pair is annotated

for relatedness in meaning and for the entailment relation between the two elements.

The files are tab-separated text file with the following fields:

- sentence pair ID

- sentence A (premise)

- sentence B (hypothesis)

- semantic relatedness gold label (on a 1-5 continuous scale)

- textual entailment gold label (NEUTRAL, ENTAILMENT, or CONTRADICTION)

For example:

77 People wearing costumes are gathering in a forest and are looking in the same direction Masked people are looking in the same direction in a forest 4.4 ENTAILMENT

85 People wearing costumes are gathering in a forest and are looking in the same direction A little girl in costume looks like a woman 2 NEUTRAL

88 There is no biker jumping in the air A lone biker is jumping in the air 4.2 CONTRADICTION

You will use Machine Learning (ML) algorithms in order to train classifiers for both tasks. I recommend using deep learning classifiers from packages such as TensorFlow, Theano, Keras, Torch, or others. In addition you can use any other ML method, for comparison. Feel free to use any tools and libraries, but you should cite them and explain how you used them.

The training data is available here. Use it for training classifiers.

Some trial data is available here. Use as development data to tune the parameters for deep learning. The test data is available here. Use it to produce the Results.txt file. Here is a version of the test data with expected solution. You can use it to compute evaluation measures for your report. You can write your own evaluation script or use the provided evaluation script (in R).

Perform the following experiments:

1. [30 marks] Task 1: Text entailment.

Train a deep learning (DL) classifier to solve this task. Here is a starter code in Theano that you can use, unless you prefer to write your own in another DL platform. Here is a diagram of the starter code.

For comparison to a simple baseline, a very simple baseline classifier computed with the script compute_overlap_baseline.py from the SemEval webpage achieves an accuracy of 56.15%, as computed with the R evaluation script. (This baseline method choses the class ENTALMENT if the overlap score between the two sentences is higher than a threshold, the class CONTRADICTION if it is lower than another threshold, and the class NEUTRAL if it is in between the two thresholds).

The main evaluation measure will be the classification accuracy on the test data. Other measures you can look at are the confusion matrices, as well as the Precision, Recall, and F-measure for each class.

2. [30 marks] Task 2: Semantic relatedness.

Train a deep learning classifier to solve this task. You can modify your code for task 1.

For comparison to a simple baseline, a very simple baseline classifier that computes the overlap between the two sentences leads to a Pearson correlation of 0.62. These baseline overlap scores were produced with the script compute_overlap_baseline.py and the correlation was computed with the R evaluation script.

The main evaluation measure will be the Pearson correlation between your scores for the test data and the gold standard ratings. Additional evaluation can be the mean squared error (computed on standardized scores) and the Spearman correlation.

[20 marks] Write a report in a file Report (.pdf, .doc, or .txt)

Explain what you did for task 1 and for task 2.

For task 1, report the accuracy of the classification on the test set for all the experiments that you ran.

For task 2, report Pearson correlation scores.

Discuss what classifier and what features led to your best results.

[20 marks] Resulst.txt

Submit the predictions of your best classifier on the test data in a file named Results.txt

Your file must contain the following 3 tab-delimited columns:

-pair_ID (the ids from the test data),

- entailment_judgment (predictions of your system for task 1; possible values: ENTAILMENT, CONTRADICTION, NEUTRAL) and

- relatedness_score (numerical predictions of your system for task 2).

Submission instructions:

- Submit your report and your best results for each sentence in a file Results.txt:

In the report include:

* the names and student numbers of the students in the group, and specify how the tasks were divided,

* explain what methods and ML algorithms you tried and what data representations (features) you used

* discuss what methods led to the best results

* a detailed note about the functionality of your programs

* complete instructions on how to run them

- Submit your assignment as a zip file, including programs, Report file, and the Result.txt file through the Virtual Campus or by email. Only one partner in a team needs to submit.

Have fun!!!