The course is a hands-on, research-level introduction to the areas of computer science that have a direct relevance to journalism, and the broader project of producing an informed and engaged public. We study two big ideas: the application of computation to produce journalism (such as data science for investigative reporting), and journalism about areas that involve computation (such as the analysis of credit scoring algorithms.)
Alon the way we will touch on many topics: information recommendation systems but also filter bubbles, principles of statistical analysis but also the human processes which generate data, network analysis and its role in investigative journalism, visualization techniques and the cognitive effects involved in viewing a visualization.
Assignments will require programming in Python, but the emphasis will be on clearly articulating the connection between the algorithmic and the editorial.
Research-level computer science material will be discussed in class, but the emphasis will be on understanding the capabilities and limitations of this technology. Students with a CS background will have opportunity for algorithmic exploration and innovation, however the primary goal of the course is thoughtful, application-oriented research and design.
Format of the class, grading and assignments.
This is a fourteen week, six point course for CS & journalism dual degree students. (It is a three point course for cross-listed students, who also do not have to complete the final project.) The class is conducted in a seminar format. Assigned readings and computational techniques will form the basis of class discussion. The course will be graded as follows:
- Assignments: 40%. There will be five homework assignments.
- Final project 40%: Dual students will be complete a medium-ish final project (others will have this 40% from assignments)
- Class participation: 20%
Assignments will involve experimentation with fundamental computational techniques. Some assignments will require intermediate level coding in Python, but the emphasis will be on thoughtful and critical analysis. As this is a journalism course, you will be expected to write clearly. The final project can be either a piece of software (especially a plugin or extension to an existing tool), a data-driven story, or a research paper on a relevant technique.
Dual degree students will also have a final project. This will be either a research paper, a computationally-driven story, or a software project. The class is conducted on pass/fail basis for journalism students, in line with the journalism school’s grading system. Students from other departments will receive a letter grade.
Week 1: Introduction and Clustering – 9/8
First we ask: where do computer science and journalism intersect? CS techniques can help journalism in two main ways: using computation to do journalism, and doing journalism about computation. We’ll spend most of our time on the former: data-driven reporting, story presentation, information filtering, and effect tracking. Then we jump right into clustering and the document vector space model, which we’ll need to study filtering.
- Computational Journalism, Cohen, Turner, Hamilton
- TF-IDF is about what matters, Aaron Schumacher
- Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
- How ProPublica’s Message Machine reverse engineers political microtargeting, Jeff Larson
Viewed in class
- A full-text visualization of the Iraq war logs, Jonathan Stray
- Using clustering to analyze the voting blocs in the UK House of Lords
Week 2: Filtering Algorithms – 9/15
The filtering algorithms we will discuss this week are used in just about everything: search engines, document set analysis, figuring out when two different articles are about the same story, finding trending topics. The main topics are matrix factorization, probabilistic topic modeling (ala LDA) and more general plate-notation graphical models, and word embeddings. Bringing it to practice we will look at Columbia Newsblaster (a precursor to Google News) and the New York Times recommendation engine.
- Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown et al
- Topic modeling by hand, Shawn Graham
- How Reddit Ranking Algorithms Work, Amir Salihefendic
- Matrix Factorization Techniques for Recommender Systems, Koren et al
- Probabilistic Topic Models, David M. Blei
- Building the Next New York Times Recommendation Engine, Alexander Spangher
- Word2Vec tutorial: The Skip-gram Model, Chris McCormick
Discussed in class
- Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
Assignment: LDA analysis of State of the Union speeches.
Week 3: Filters as Editors – 9/22
We’ve studied filtering algorithms, but how are they used in practice — and how should they be? We will study the details of several algorithmic filtering approaches used by social networks, and effects such as polarization and filter bubbles.
- Who should see what when? Three design principles for personalized news Jonathan Stray
- How Facebook’s Foray into Automated News Went from Messy to Disastrous, Will Oremus
- Can an algorithm be wrong?, Tarleton Gillespie
- Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter, Liu et al.
- What Happens to #Ferguson Affects Ferguson: Net Neutrality, Algorithmic Filtering and Ferguson, Zeynep Tufekci
- How does Google use human raters in web search?, Matt Cutts
Viewed in class
- Israel, Gaza, War & Data: social networks and the art of personalizing propaganda, Gilad Lotan
- What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,
Week 4: Computational Journalism Platforms – 9/29
We introduce the Overview document mining system and the Computational Journalism Workbench. Then we develop pitches for final projects, which may include writing plugins for these systems.
Guest Speaker: Alex Spangher, New York Times
Assignment – Design a filtering algorithm for an information source of your choosing
Week 5: Quantification, Counting, and Statistics – 10/6
Every journalist needs a basic grasp of statistics. Not t-tests and all of that, but more grounded. Where does data come from at all? How do we know we’re measuring the right thing, and measuring it properly? Then a solid understanding of the concepts that come up most in journalism: relative risk, conditional probability, the regressions and control variables, the use of statistical models generally.
- The Quartz Guide to Bad Data, Christopher Groskopf
- The Curious Journalist’s Guide to Data: Quantification, Jonathan Stray
- Why Not to Trust Statistics, Ben Orlin
- Statistics for Decision Makers: Base Rate Fallacy, Bernard Szlachta
- Solve Every Statistics Problem with One Weird Trick, Jonathan Stray
- Operationalizing, or the function of measurement in modern literary theory, Franco Moretti
- The Curious Journalist’s Guide to Data: Prediction, Jonathan Stray
Week 6: Inference and Persuasion – 10/13
This week is all about using data to report on ambiguous, complex, charged issues. It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis. This week includes: statistical testing and statistical significance, Bayesianism in theory and practice, determining causality, p-hacking and reproducibility, analysis of competing hypothesis.
- The Curious Journalist’s Guide to Data: Analysis, Jonathan Stray
- I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How, John Bohannon
- Why most published research findings are false, John P. A. Ioannidis
- If correlation doesn’t imply causation, then what does?, Michael Nielsen
- The Psychology of Intelligence Analysis, chapter 8. Richards J. Heuer
- The Introductory Statistics Course: a Ptolemaic Curriculum, George W. Cobb
Viewed in class
Week 7: Discrimination and Algorithmic Accountability – 10/20
Two topics this week. Discrimination is an important topic for reporters and for society, but analyzing discrimination data is more subtle and complex than it might seem. Algorithmic accountability is the study of the algorithms that regulate society, from high frequency trading to predictive policing. We’re at their mercy, unless we learn how to investigate them.
- Sex Bias in Graduate Admissions: Data from Berkeley, P. J. Bickel, E. A. Hammel, J. W. O’Connell
- Testing for Racial Discrimination in Police Searches of Motor Vehicles, Simoiu et al
- How the Journal Tested Prices and Deals Online, Jeremy Singer-Vine, Ashkan Soltani and Jennifer Valentino-DeVries
- How We Analyzed the COMPAS Recidivism Algorithm, Larson et al.
- Big Data’s Disparate Impact, Barocas and Selbst
- How Algorithms Shape our World, Kevin Slavin
Assignment: Analyze NYPD stop and frisk data for racial discrimination.
Week 8: Visualization, Network Analysis – 10/27
Visualization helps people interpret information. We’ll look at design principles from user experience considerations, graphic design, and the study of the human visual system. Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, and inreasingly in journalism.
- Visualization, Tamara Munzner
- Network Analysis in Journalism: Practices and Possibilities, Stray
- 39 Studies about Human Perception in 30 minutes, Kennedy Elliot
- Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists, Brehmer et al.
- Visualization Rhetoric: Framing Effects in Narrative Visualization, Hullman and Diakopolous
- Analyzing the Data Behind Skin and Bone, ICIJ
- Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes.
- The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
- Simmelian Backbones: Amplifying Hidden Homophily in Facebook Networks. A soophisticated and sociologically-aware network analysis method.
- The network of global corporate control, Vitali et. al.
- Galleon’s Web, Wall Street Journal
Assignment: Compare different centrality metrics in Gephi.
Week 9 Knowledge representation –
How can journalism benefit from encoding knowledge in some formal system? Is journalism in the media business or the data business? And could we use knowledge bases and inferential engines to do journalism better? This gets us deep into the issue of how knowledge is represented in a computer. We’ll look at traditional databases vs. linked data and graph databases, entity and relation detection from unstructured text, and traditional both probabilistic and propositional formalisms. Plus: NLP in investigative journalism, automated fact checking, and more.
- Identifying civilians killed by police with distantly supervised entity-event extraction, Keith et. al
- Extracting References from Political Speech Auto-Transcripts, Brandon Roberts
- A fundamental way newspaper websites need to change, Adrian Holovaty
- Relation extraction and scoring in DeepQA – Wang et al, IBM
- The State of Automated Fact Checking, Full Fact
- Storylines as Data in BBC News, Jeremy Tarling
- Building Watson: an overview of the DeepQA project
Viewed in class
- The next web of open, linked data – Tim Berners-Lee TED talk
- Connected China, Reuters/Fathom
Assignment: Text enrichment experiments using StanfordNER entity extraction.
Week 10: Truth and Trust
Computational propaganda. Structure of information operations. Fake news detection and tagging. Credibility schema. Systems to detect and combat abuse and harassment.
Speaker: Ed Bice, Meedan
- Information Operations and Facebook
- Now Anyone can Deploy Google’s Troll-fighting AI, Wired
- NYT comment moderation game
- Recent harassment research
Week 11: Privacy, Security, and Censorship
Who is watching our online activities? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we will talk about who is watching and how, and how to create a security plan using threat modeling.
- Digital Security for Journalists, Part 1 and Part 2, Jonathan Stray
- Hearst New Media Lecture 2012, Rebecca MacKinnon
- CPJ journalist security guide section 3, Information Security
- Global Internet Filtering Map, Open Net Initiative
- Unplugged: The Show part 9: Public Key Cryptography
- Diffe-Hellman key exchange, ArtOfTheProblem
- Tor Project Overview
- Who is harmed by a real-names policy, Geek Feminism
Assignment: Use threat modeling to come up with a security plan for a given scenario.
Week 12: Tracking flow and impact
How does information flow in the online ecosystem? What happens to a story after it’s published? How do items spread through social networks? We’re just beginning to be able to track ideas as they move through the network, by combining techniques from social network analysis and bioinformatics.
- Metrics, Metrics everywhere: Can we measure the impact of journalism?, Jonathan Stray
- Meme-tracking and the Dynamics of the News Cycle, Leskovec et al.
- How promotion affects pageviews on the New York Times website, Brian Abelson
- NewsLynx: A Tool for Newsroom Impact Measurement, Michael Keller, Brian Abelson
- The role of social networks in information diffusion, Eytan Bakshy et al.
- Defining Moments in Risk Communication Research: 1996–2005, Katherine McComas
- Chain Letters and Evolutionary Histories, Charles H. Bennett, Ming Li and Bin Ma
- Competition among memes in a world with limited attention, Weng et al.
- Zach Seward, In the news cycle, memes spread more like a heartbeat than a virus
- How hidden networks orchestrated Stop Kony 2012, Gilad Lotan
Final projects due 12/20 (dual degree Journalism/CS students only)