GitHub - mtlynch3/language-id: Performs a language identification task using n-grams

mtlynch3 / language-id Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Performs a language identification task using n-grams

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.DS_Store		.DS_Store
README.txt		README.txt
bigrams.py		bigrams.py
dicts.py		dicts.py
hw2.py		hw2.py
lang_test		lang_test
output_letter.txt		output_letter.txt
output_word.txt		output_word.txt
solutions		solutions
train_english		train_english
train_french		train_french
train_italian		train_italian

Repository files navigation

Melissa Lynch
LING 406: Introduction to Computational Linguistics
Spring 2017

Assignment 2: Language models applied to the task of language identification

FILES
bigrams.py
Given. I modified the file to give unigrams as well as bigrams.

dicts.py
Given.

hw2.py
Main file. Makes both unigrams and bigrams for each language (using a training data file). For each line in the test data, the program compares the bigram frequencies of the line to those of the training data bigrams for each language. The language with the highest probibility is chosen.

To run, use Python 2.7. The program takes 4 arguments; the first is for the English training data file, the second is for the French training data file, the third is for the Italian training data file, and the fourth is for the test data file. The file must be run using one of two options: "-l" runs the program using letter bigrams, and "-w" runs the program using word bigrams. Using the "-l" option will direct output to the file "output_letter.txt", and using the "-w" will direct output to the file "output_word.txt".

Example:

python hw2.py english_train.txt french_train.txt italian_train.txt test_data.txt -l

ANALYSIS
The letter bigrams produced a more accurate output than the word bigrams. The letter bigrams got one incorrect (line 22) and the word bigrams got three incorrect (line 44, line 244, line 262). More than 86% of the English word bigrams have a frequency less than 3; compare this to only 22% of the English letter bigrams. The frequencies for the letter bigrams are therefore going to be a better representation of their frequencies in the language as a whole. For example, it's very possible that in our data, the bigram ('speak', 'English') has a frequency of 1; the bigram ('green', 'elephant') could also have a frequency of 1. This would also mean that the probility of ('speak', 'English') occuring is the same as the probility of ('green', 'elephant') occuring. However, for English in general, the bigram ('speak', 'English') is significantly more likely than ('green', 'elephant').