Homework 2: Language Models

In this assignment, we'll explore the statistical approach to language modeling using n-grams language models.

Learning objectives:

Deliverables:

  1. lm.py
  1. hw2-my-test.txt (containing your out-of-domain test set)

Submit your work via Gradescope as a python script (note! Do not submit an ipython notebook/jupyter notebook file). We will be running unit tests on lm.py. You should not have any code that is outside of classes, functions or an if __name__ == "__main__": conditional guard. We will be running your code with the command:

python3 lm.py berp-training.txt hw2-test.txt hw2-my-test.txt

💡
important Gradescope Notes: 1. if you use a library that we don't have installed on the autograder, it will cause it to fail— just write us a private message on piazza and we'll update that for you, but it may take a little bit of time. 2. Some of the packages that we use aren't supported alongside python 3.8 on the backend for autograding, so this will be python version 3.6.9

Materials

Instructions

Your program will instantiate 2 language models, one unigram model and one bigram model, both using Laplace smoothing. With each model,  you will do the following tasks:

  1. Display 50 generated sentences from this model using Shannon’s method. (see clarification about generation + Laplace smoothing)
  1. Score the probabilities of the provided test sentences and display the average and standard deviance of these sentences.
    1. once for the provided test set
    1. once for the test set that you curate

Do both tasks with both models. See the later instructions for detailed notes on both parts.

Grading Details

Your program will have 60 auto-graded points that judge the correct implementation of part 1. The remaining 40 points will be manually graded, split as follows:

Part 1: Build an n-gram language model (60 points)

You need to develop an n-gram model that could model any order n-gram, but that we'll be using specifically to look at unigrams and bigrams. Specifically, you'll write code that builds this language model from the training data and provides functions that can take a sentence in (formatted the same as in the training data) and return the probability assigned to that sentence by your model.

For unknown words, you should use the technique described in class and in the book that does not depend on having a vocabulary beforehand. From your training data, replace any token with a frequency of 1 with the token <UNK>; collect statistics on that token as with any other word and use those counts whenever an unknown word is encountered during the evaluation phase.

For smoothing counts, your model has the option to use Laplace/add-1 smoothing. Note that add-1 smoothing happens after the <UNK> pre-processing. Therefore, you should have counts for things like (eat, <UNK>) and (<UNK>, on) and the like. These will come into play when you encounter unknown words during evaluation.

You'll be implementing a LanguageModel class in python. We’ve left skeletons of the methods that you must implement in the starter code. Your goal is to implement the __init__, train, and score methods. You are always welcome to add more methods to the class or outside helper functions as appropriate. We recommend taking a look at previous work that you’ve done in lecture notebooks and quizzes in particular.

Part 2: Implement Sentence Generation (15 points)

In this part, you’ll implement sentence generation for your Language Model using Shannon’s method. Start by generating the <s> token, then sampling from the n-grams beginning with <s>. Stop generating words when you hit an </s> token.

Notes:

  1. When generating sentences for unigrams, do not count the pseudoword <s> as part of the unigram probability mass after you've chosen it as the beginning token in a sentence.
  1. All unigram sentences that you generate should start with one <s> and end with one </s>
  1. For n-grams larger than 1, the sentences that you generate should start with n−1n - 1ï»ż <s> tokens. They should end with n−1n - 1ï»ż </s> tokens. Note that for n>2n > 2ï»ż, there should be no chance of generating a word other than </s> after your first </s> token. You may “tack these on” the end if you would like.
  1. You should not generate novel (unseen) n-grams using this method of sentence generation.
  1. You should expect to see sentences that are just <s> </s> in the unigram case.
  1. You should expect to see sentences with <UNK>.

Part 3: Create an out-of-domain test set  & measure performance (10 points)

Find or curate an out-of-domain test set of 100 sentences. You’ll need to make sure that they are formatted consistently with the provided test set. Each sentence should begin with <s> and end with </s>, and should not end with punctuation. You may need to do some curation of this by hand. For each test set, score all sentences in the test set, then compute two numbers — the average and the standard deviation of the set, then report these numbers to the user. Feel free to use your favorite python library to calculate these numbers.

Part 4: Flexible model (up to +3 extra-credit for CS4120)

We will only test your models for unigrams and bigrams, however, you should always strive to write flexible and extensible code. Make your LanguageModel class work for any integer value of n>=1n >= 1ï»ż . We have provided training data from tri-grams and 4-grams but you may want to produce your own training data for higher-order n-grams.

💡
If you are in CS 6120, not implementing a flexible model will result in a 15 point deduction. We will be testing your models against higher-order n-grams. If your model times out the autograder, this will result in an automatic 5 point deduction, even if we are able to run the full tests locally.

Part 5: Perplexity (up to +5 extra-credit for CS4120)

Add a method with the following signature to your LanguageModel class:

def perplexity(self, test_sequence):
	"""
		Measures the perplexity for the given test sequence with this trained model. 
		As described in the text, you may assume that this sequence may consist of many sentences "glued together".

    Parameters:
      test_sequence (string): a sequence of space-separated tokens to measure the perplexity of
    Returns:
      float: the perplexity of the given sequence
    """

Modify your program so that at the end it prints out the perplexity of the first 10 sentences (glued together) of the two passed-in testing files: hw2-test.txt and hw2-my-test.txt

E.g.

# Perplexity for 1-grams:
`hw2-test.txt`: [your calculated number here]
`hw2-my-test.txt`: [your calculated number here]

# Perplexity for 2-grams:
`hw2-test.txt`: [your calculated number here]
`hw2-my-test.txt`: [your calculated number here]

💡
If you are in CS 6120, not implementing perplexity correctly will result in a 7.5 point deduction.

Clarifications/hints

Based on questions during office hours, we've added the following clarifications/hints.

Test Failed: 0.1 != 0.06666666666666667 within 3 places : tests probability of <s> flamingo, trained on iamsam2.txt