Assignment 3: Entity extraction | Computational Journalism

For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service. You’ll do this by building a text enrichment program, which takes plain text and outputs HTML with links to the detected entities. Then you will take five random news articles, enrich them, and manually count how many entities OpenCalais missed or got wrong.

1. Get an OpenCalais API key, from this page.

2. Install the python-calais module. This will allow you to call OpenCalais from Python easily. First, download the latest version of python-calais. To install it, you just need calais.py in your working directory. You will probably also need to install the simplejson Python module. Download it, then run “python setup.py install.” You may need to execute this as super-user.

Edit: python-calais no longer works.

Here’s the info on the new Open Calais API in Python. All I needed to know was here: http://www.opencalais.com/wp-content/uploads/2015/06/Thomson-Reuters-Open-Calais-Upgrade-Guide-v4.pdf

Here’s a little Python excerpt using the Requests library to make the call.

def analyze(content_string):
   header_content = {'Content-Type': 'text/raw',
'x-ag-access-token': API_KEY,
'outputFormat': 'application/json'}
   r = requests.post('https://api.thomsonreuters.com/permid/calais',
headers=header_content,
data=content_string)
   return r.json()

3. Call OpenCalais from Python. Make sure you can successfully submit text and get the results back, following these steps. The output you want to look at is in the entities array, which would be accessed as “results.entities” using the variable names in the sample code. In particular you want the list of occurrences for each entity, in the “instances” field.

>>> result.entities[0]['instances']
[{u'suffix': u' is the new President of the United States', u'prefix': u'of the United States of America until 2009.  ', u'detection': u'[of the United States of America until 2009.  ]Barack Obama[ is the new President of the United States]', u'length': 12, u'offset': 75, u'exact': u'Barack Obama'}]
>>> result.entities[0]['instances'][0]['offset']
75
>>>

Each instance has “offset” and “length” fields that indicate where in the input text the entity was referenced. You can use these to determine where to place links in the output HTML.

Edit: OpenCalais returns a different result now, more like this

for result[x] where x!="doc"
   for result[x]["_typeGroup"] == "entities"
      offset = result[x]["instances"][j]["offset"]
      length = result[x]["instances"][j]["length"]
      url = result[x]["resolutions"][j]["id"]

4. Read a text file, create hyperlinks, and write it out. Your Python program should read text from stdin and write HTML with links on all detected entities to stdout. There are two cases to handle, depending on how much information OpenCalais gives back.

In many cases, like the example in step 3, OpenCalais will not be able to give you any information other than the string corresponding to the entity, result.entities[x][‘name’]. In this case you should construct a Wikipedia link by simply appending to the name to a Wikipedia URL, converting spaces to underscores, e.g.

http://en.wikipedia.org/wiki/Barack_Obama

In other cases, especially companies and places, OpenCalias will supply a link to an RDF document that contains more information about the entity. For example.

>>> result.entities[0]{u'_typeReference': u'http://s.opencalais.com/1/type/em/e/Company', u'_type': u'Company', u'name': u'Starbucks', '__reference': u'http://d.opencalais.com/comphash-1/6b2d9108-7924-3b86-bdba-7410d77d7a79', u'instances': [{u'suffix': u' in Paris.', u'prefix': u'of the United States now and likes to drink at ', u'detection': u'[of the United States now and likes to drink at ]Starbucks[ in Paris.]', u'length': 9, u'offset': 156, u'exact': u'Starbucks'}], u'relevance': 0.314, u'nationality': u'N/A', u'resolutions': [{u'name': u'Starbucks Corporation', u'symbol': u'SBUX.OQ', u'score': 1, u'shortname': u'Starbucks', u'ticker': u'SBUX', u'id': u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'}]}
>>> result.entities[0]['resolutions'][0]['id']
u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'
>>>

Edit: change JSON paths to new output format. Also: remind to use offset/length to find entities in input text, don’t search as capitalization etc. may be different (also pronouns)

In this case the resolutions array will contain a hyperlink for each resolved entity, and this is where your link should go. The linked page will contain a series of triples (assertions) about the entity, which you can obtain in machine-readable from by changing the .html at the end of the link to .json. The sameAs: links are particularly important because they tell you that this entity is equivalent to others in dbPedia and elsewhere.

Here is more on OpenCalias’ entity disambiguation and use of linked data.

The final result should look something like below. Note that some links go to OpenCalais entity pages with RDF links on them (“London”), some go to Wikipedia (“politician”) and some are broken links when Wikipedia doesn’t have the topic (“Aarthi Ramachandran”) And of course Mr Gandhi is an entity that was not detected, three times.

The latest effort to “decode” Mr Gandhi comes in the form of a limited yet rather well written biography by a political journalist, Aarthi Ramachandran. Her task is a thankless one. Mr Gandhi is an applicant for a big job: ultimately, to lead India. But whereas any other job applicant will at least offer minimal information about his qualifications, work experience, reasons for wanting a post, Mr Gandhi is so secretive and defensive that he won’t respond to the most basic queries about his studies abroad, his time working for a management consultancy in London, or what he hopes to do as a politician.

Don’t worry about producing a fully valid HTML document with headers and a <body> tag, just wrap each entity with <a href=”…”> and </a>. Your browser will load it fine.

5. Pick five random news stories and enrich them. Pick an English-language news site with many stories on the home page, or a section of such a site (business, sports, etc.) Then generate five random numbers from 1 to the number of stories on the page. Cut and paste the text of each article into a separate file, and save as plain text (no HTML, no formatting.)

Edit: go through articles by hand counting entities. Really need to drive the point home that we can only compute precision/recall relative to a human ground truth.

6. Read the enriched documents and count to see how well OpenCalais did. You need to read each output document very carefully and count three things:

Entity references. Count each time there is a name of a person, place, or organization appears, or other references to these things (e.g. “the president.”)
Correctly detected entities. Count an entity as correctly detected if the link goes to the right page — OpenCalais RDF pages where possible, Wikipedia when not.
Incorrectly detected entities. Count as incorrect if the link goes to the wrong page, a disambiguation page in Wikipedia. Also, a broken link counts as an incorrect reference.

7. Turn in your work. Please turn in:

Your code
The enriched output from your documents
A brief report describing your results.
The totals as described below

The report should include a table of the three numbers — references, correct, incorrect — for each document, plus the totals of these three numbers across all documents. Also report on any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors?

This assignment is due before class on Friday, November 20.