Week 2: Clustering

A vector of numbers is a fundamental data representation which forms the basis of very many algorithms in data mining, language processing, machine learning, and visualization. This week we will looked in depth at two things: representing objects as vectors, and clustering them, which might be the most basic thing you can do with this sort of data. This requires a distance metric and a clustering algorithm — both of which involve editorial choices. In journalism we can use clusters to find groups of similar documents, analyze how politicians vote together, or automatically detect groups of crimes.

Slides for week 2 are here.

See also a discussion of (and files for reproducing) the UK House of Lords voting analysis we did in class, this one:

Here are the main references on this material (from the syllabus):

And here are some of the other things we looked at today:

Other uses of clustering in journalism:

Comments are closed.