CSI4107,
Fall 2013
Assignment
2
Due
Monday April 1, extended till April 8, 22:00
Sentiment Analysis in Twitter
Messages [100 points]
Note: You
will work in groups of two students.
In this assignment, you will classify
tweeter messages as expressing a positive opinion, a negative opinion, a neutral
opinion, or no
opinion (objective).
Read more about the task at:
http://www.cs.york.ac.uk/semeval-2013/task2/
We will focus on Task B: Message
Polarity Classification: Given a message, classify it into one of the 4
clases: positive, negative, neutral, or objective.
For messages
conveying both a positive and negative sentiment, whichever is the stronger sentiment
should be chosen (while neutral could mean that they are
equal). Objective mean that no opinion is expressed, but
rather a fact is stated.
The training data consists in 5894
tweets and it is available here. The class labels will be
used for training classifiers. Features need to be
extracted from the training data (and their valued put in the arff file).
The test data consists in 2131
tweets and it is available here. The trained classifiers
will label the test data, and the labels from the file will be used only for
evaluation, to compare the predicted labels with the expected labels.
The same features as in the training datra should be used, except that
their values need to be computed on the test data. Note that
the data and is not quite identical to the Semeval data and that the task was slightly modified.
The format of the data is, for each
line:
<SID><tab><UID><tab><TOPIC><tab><positive|negative|neutral|objective><tab><TWITTER_MESSAGE>
Example of one line:
100032373000896513 15486118 lady
gaga "positive" Wow!! Lady Gaga is actually at the Britney
Spears Femme Fatale Concert tonight!!! She still listens to her music!!!!
WOW!!!
You will use Machine Learning (ML)
algorithms from a tool named Weka. First you
will need to install Weka. It is written in Java. See more details and documentation
about Weka. It can be used through its graphical user interface (Explorer)
or directly
from Java programs through its API.
You to write a program that extracts
features from the tweets and save them in an .arff file (one file for the
training data and one file for the test data). After that, you can open the
arff file in Weka's GUI and run any machine learning algorithms that are
appropriate for your task.
An example of possible format for an
.arff file is the following:
@RELATION example_rel
@ATTRIBUTE a1 STRING
@ATTRIBUTE a2 {Y,N}
@ATTRIBUTE a3 NUMERIC
@ATTRIBUTE a4 NUMERIC
@ATTRIBUTE class {C1, C2, C3}
@DATA
Str1,Y,1.4,0.2,C1
Str2,N,1.4,0.2,C2
Str3,Y,1.3,0.2,C1
Str1,N,1.5,0.2,C1
Str4,Y,1.4,0.2,C3
...
Try at least three classifiers from
Weka. The main ones to try are SVM (SMO in Weka) because it tends to get the
best results, Naive Bayes because it works well with texts, and Decision Trees
(J48 in Weka) because you can see the tree that is learnt.
Perform the following experiments:
1. [30 marks]
Train a classifier using the bag-of-words (BOW) representation. This means to
use words for the messages as features in the arff file. You cane eliminate
stop words, rare words, punctuation, etc in order to reduce the dimension of
the vector space.
2. [30 marks]
Add more features and train more classifiers, in order to try to improve the
classification results. For example using the emoticons from the texts as
features as should help. Using punctuation marks such as !. !!, !!!, ??, ???,
and others elongations could help. Other features can be the number of positive
words in the messages, and the number of negative words in the message (you can
use lists of positive and negative words in order to count these kinds of
words). Try at least one of these resources. (if you more, you can use separate
features for number of positive/negative words from messages that are found in
each resources individually).
[20 marks] Write a report in a file Report (.pdf, .doc, or .txt)
Explain what you did for step 1, and
what extra features you computed in step 2.
Report the accuracy of the
classification on the test set for all the experiments that you ran, for the three
classifiers (SVM, NB, DT), the confusion matrices, as well as the Precision,
Recall, and F-measure for each of the three classes, as calculated by Weka.
Discuss what classifier and what
features led to your best results.
[20 marks] Resulst.txt
Submit
the predictions for the test set in a file named Results.txt, as calculated by
Weka on the test set (select the option Output predictions). The format should
be the one produced by Weka. You can copy and paste Weka's results in the
Results.txt file.
Resources: Any resources you want to use. Include in the
Report file explanations on how you used them.
Here are resources that include
lists of positive and negative words: General Inquirer, LIWC, List of Adjectives
with semantic orientation (from Maite Taboada), Polarity lexicon (from Theresa
Wilson), SentiWordNet, list from Kim and Hovy (automatically produced but
contains a lot more words than the manually produced lists), etc.
Submission instructions:
- Submit your report and your best results
on the test set in a file Results.txt:
In the
report include:
* the names and student numbers of the
students in the group, and specify how the
tasks were divided,
* explain what you did for the steps 1
and 2, what ML algorithms you tried and what data representations (features)
you used
* discuss what classification method
and feature representation led to the best results
* a detailed note about the
functionality of your programs that extract features
* complete instructions on how to run
them
- Submit your assignment, including
programs, Report file, and the Result.txt file through the Blackboard Learn.
Have fun!!!