CSI
5386: NLP
Project Description
Project
Proposal (2-3 pages) Due: After the reading week
Final Project Report Due: At the end of the exam period
Project presentation: Last class
Demo (optional during presentation)
Introduction
In this project, you are expected (1) to select a particular area
of NLP that interests you, (2) to conduct a literature search on this area, (3)
to focus on a specific problem in the area you selected, and (4a) to design and
implement a novel learning scheme or (4b) to extend an existing scheme to deal
with the problem you have identified. Alternatively (4c), you can compare the
performance of different existing schemes on the specific problem you have
identified in (1), (2) and (3) and on different corpora.
It is important to start working on this project early. I suggest that you
start reading the textbook, some of its suggested follow-up material,
conference proceedings, journals, and papers available from the Web, early
enough to settle quickly on a subject of interest to you. I will be available
for discussions both before the project proposal is due and after that, during
the development of your research.
In order to help you select a topic, here is a list of
project suggestions though you are more than welcome to propose your own idea.
Sources of datasets and project ideas:
· SemEval
· CLEF
· Kaggle (search for text data)
· TREC
Other project suggestions
- Neural language models
for different applications and comparison to other types of language
models.
- Extract information from
medical texts (patient data or scientific articles).
- Detects topics, events,
opinions, or user profiles from social media texts. Apply deep learning
techniques.
- Implement a system for
automatic classification and information extraction from medical articles.
Apply deep learning techniques.
- Implement a system for
automatic classification of poems by themes or styles.
- Compare the performance
of several terminology extraction systems on several corpora. Describe the
strengths and weaknesses of each of them.
- Compare the performance
of several tools for extracting multi-word expressions.
- Compare the performance
of various machine learning tools on different representations of the
REUTERS text categorization data set (e.g., document embeddings, word
embeddings, bag of word representation, keyword representation, bag of
word representation of summaries of the text, etc.).
- Design a method for
establishing the degree of similarity between two documents in different
languages. Maybe using word embeddings or neural topic models.
- Design a system that
makes use of a bilingual corpus or wikipedia pages
to perform word sense disambiguation (or compare WSD systems).
- Design a system that
improves (in some ways, such as word order, verb tense, choice of
preposition, word sense disambiguation, etc) an
existing machine translation system.
- Design a system that
detects proper nouns and/or geographical entity in text (or other kinds of
entities and relations between entities).
- Compare the performance
of several part-of-speech taggers on social media texts versus newspaper
texts, especially based on deep learning.
- Compare the performance
of several parsers or chunkers on social media
texts versus newspaper texts, especially based on deep learning.
- Develop any of the above
systems for languages other than English. French is of special interest.