In this assignment you will implement the TF-IDF formula and use it to study the topics in State of the Union speeches given every year by the U.S. president.
1. Download the source data file state-of-the-union.csv. This is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. You will write a Python program that reads this file and turns it into TF-IDF document vectors, then prints out some information. Here is how to read a CSV in Python. You may need to add the line
to the top of your program to be able to read this large file.
2. Tokenize the text each speech, to turn it into a list of words. As we discussed in class, we’re going to tokenize using a simple scheme:
- convert all characters to lowercase
- remove all punctuation characters
- split the string on spaces
3. Compute a TF (term frequency) vector for each document. This is simply how many times each word appears in that document. You should end up with a Python dictionary from terms (strings) to term counts (numbers) for each document.
4. Count how many documents each word appears in. This can be done after computing how the TF vector by each document, by incrementing the document count of each word that appears in the TF vector. After reading all documents you should now have a dictionary from each term to the number of documents that term appears in.
5. Turn the final document counts into IDF (inverse document frequency) weights by applying the formula IDF(term) = log(total number of documents / number of documents that term appears in.)
6. Now multiply the TF vectors for each document by the IDF weights for each term, to produce TF-IDF vectors for each document.
7. Then normalize each vector, so the sum of squared weights is 1. You can do this by dividing each component of the document vector by the length of the vector. Use the usual formula for the length of a vector.
8. Congratulations! You have a set of TF-IDF vectors for this corpus. Now it’s time to see what they say. Take the speech you were assigned in class, and print out the highest weighted 20 terms, along with their weights. What do you think this particular speech is about? Write your answer in at most 200 words.
9. Your task now is to see if you can understand how the topics changed since 1900. For each decade since 1900, do the following:
- sum all of the TF-IDF vectors for all speeches in that decade
- print out the top 20 terms in the summed vector, and their weights
Now take a look at the terms for each decade. What patterns do you see? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…) Write up what you see in narrative form, no more than 500 words, referring to the terms for each decade.
10. Hand in by email, before class next week:
- your code
- the printout and analysis from step 7
- the printout and narrative from step 8.
You will be marked on two things: 1) the correctness of your code (but not code style, I don’t care about that in this class) and 2) the quality of your writeup. I am looking for you to connect the patterns in the data with the historical context. The resulting writeup must be interesting and informative to someone who does not know or care about the data.