CSI5386

CSI5386, Winter 2024

Assignment 2

Due Sunday March 24, 10pm

Machine-Generated Text Detection [100 points]

Note: You will work in the formed groups of students.

In this assignment, you will classify texts into having been generated by a Large Language Model (LLM) or not.

The data to use is part of the SemEval 2024 Task 8, https://github.com/mbzuai-nlp/SemEval2024-task8 Multidomain, Multimodel and Multilingual Machine-Generated Text Detection.

We will focus only on Subtask A. Binary Human-Written vs. Machine-Generated Text Classification: Given a full text, determine whether it is human-written or machine-generated. Use only the data for subtask A: monolingual, that is only the English texts. Please ignore the multilingual data.

Download the training data from the Google drive link for Task A, https://drive.google.com/drive/folders/1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppcthe file subtaskA_train_monolingual.jsonl (there is a dev data file there, you can use it or not, it is optional). Subtask A (monolingual) contains 119,757 texts for training and 5,000 as dev data, with labels, in JSON format, for each object:

{

id -> identifier of the example,

label -> label (human text: 0, machine text: 1,),

text -> text generated by a machine or written by a human,

model -> model that generated the data,

source -> source (Wikipedia, Wikihow, Peerread, Reddit, Arxiv) on English or language (Arabic, Russian, Chinese, Indonesian, Urdu, Bulgarian, German)

}

The test data is available in the file subtaskA_monolingual.jsonl

At https://drive.google.com/drive/folders/10DKtClzkwIIAatzHBWXZXuQNID-DNGSG

as well as the expected solution (gold standard) in a file with the same name

https://drive.google.com/drive/folders/13aFJK4UyY3Gxg_2ceEAWfJvzopB1vkPc

Your system should produce a prediction file, one single JSONL file for all the texts in the test data. The entry for each text must include the fields "id" and "label".

There is a format checker if you want to verify that your prediction file complies with the expected format. It is located in the format_checker module in the subtask A directory.

This is a mini-competition to see which team can achieve the best accuracy. There is a scorer script provided (get the one for subtask A).

The main evaluation metric is accuracy. However, the scorer also reports macro-F1 and micro-F1. The scorer is run by a command line with two files as parameters (expected solution, your solution), something like:

python3 subtaskA/scorer/scorer.py --gold_file_path=<path_to_gold_labels> --pred_file_path=<path_to_your_results_file>

Please submit a file named Results.jsonl for your best model, for the test data.

Please compute the scores (for the test data) with the provided evaluation script for all models you trained and put them in our report. Then explain which one achieved your best accuracy.

You can use any ML/DL algorithms, tools, or existing software, as long as you explain in your report and give credit properly. If you use existing code that solves the problem, make sure you add something to it.

Perform the following experiments:

1. [20 marks] Train a simple classifier, as a baseline. It could be a traditional classifier (SVM, Random Forest, NB, or other), or using some pre-trained models based on deep learning (pre-trained word embeddings or text embeddings or other models, fine-tuned or not). In fact, there are two baselines provided, based on transformers. You can run at least one of them and explain in your report what method was used and what was the accuracy you obtained.

2. [40 marks] Train at least two advanced classifiers based on deep learning, such as fine-tuning a type of BERT model for the first method (though not the version from the baseline in part 1); and using a recent type of generative LLM for the second method (such as Llama or something equivalent). Use part of the training data for validation (or use the dev data for validation) when building your models and keep aside the test data for the final testing. (Alternatively, you can try prompt-based learning with LLMs for the second method).

[20 marks] Write a report in a file Report (.pdf, .doc, or .txt)

Explain what you did for 1), and 2).

Report the accuracy of the classification on the test set for all the experiments that you ran. Also reports macro-F1 and micro-F1. Discuss what classifier/method led to your best results. Put the accuracies of your two best models (plas the accuracy of the baseline) in a table like the one below.

Baseline (what model)	Model 1	Model 2

[20 marks] Results.jsonl

Submit the predictions of your best classifier on the test set in a file named Results.jsonl. The format should be the one specified above.

Submission instructions:

- Submit your report and your best results for each sentence in a file Results.jsonl:

In the report include:

* the names and student numbers of the students in the group, and specify how the tasks were divided,

* explain what you did in your experiments, what ML/DL algorithms you tried

* discuss what classification method and text representation led to the best results

* a detailed note about the functionality of your programs

* complete instructions on how to run them

- Submit your assignment as a zip file, including programs, Report file, and the Result.jsonl file through BrightSpace.

Have fun!!!