Loading

Module 8: Introduction to Data Journalism

Notes
Study Reminders
Support
Text Version

Gathering the Data

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Gathering the Data

Untitled Slide
Streamlining

Streamlining

Streamlining

Source

Source

Gathering Data

Accessing Data

So, you're all ready to get started on your first data journalism project. What now? First of all you need some data.

You need to learn how to find data on the web, how to request it using freedom of information laws, how to use screen-scraping to gather data from unstructured sources, and how to use crowd-sourcing to collect your own datasets from your readers.

You also need to know what the law says about republishing datasets, and how to use simple legal tools to let others reuse your data.

Untitled Slide
Straight to the Source

Another trick used in getting hold of data that is held by a public entity is to try to go directly to the data holder, not the public affairs person and not through an FOIA.

Crafting an FOIA or public records request will start the wheels turning, but slowly.

However, if you successfully reach the person who handles data for that organization, you can ask questions about what data they keep on the subject and how they keep it.

Streamlining Your Search

While they may not always be easy to find, many databases on the web are indexed by search engines, whether the publisher intended this or not. Here are a few tips:

• When searching for data, make sure that you include search terms relating to the content as well as the format or source of the data

• Another popular trick is not to search for content directly, but for places where bulk data may be available

• You can also search by part of a URL. Googling for "inurl:downloads filetype:xls" will find all excel files that have downloads in their web address


Forums

Forums

Forums

Experts

Experts

Mailing Lists

Mailing Lists

Government IT

Government IT

Gathering Data

Data Sites and Services

Over the last few years, a number of dedicated data portals, data hubs, and other data sites have appeared on the web. These are a good place to get acquainted with the kinds of data that is out there. For starters you might like to take a look at:

• The Data Hub: community-driven resource that makes it easy to find, share, and reuse openly available sources of data

• Scraper Wiki: An online tool for extracting useful bits of data

• UN Data Portal: Provides high-level indicators for all countries

• Research Data: Nnational and disciplinary aggregators of research data

You can also check out the areas listed below.

Untitled Slide
Government IT

Understanding the technical and administrative context in which governments maintain their information is often helpful when trying to access data.

Find government organizational charts and look for departments/units with a cross-cutting function (e.g., reporting, IT services), then explore their websites. A lot of data is kept in multiple departments and while for one, a particular database may be their crown jewels, another may give it to you freely.

Look out for dynamic info graphics on government sites. These are often powered by structured data sources (APls), that can be used independently (e.g., flight tracking applets, weather forecast Java apps).


Mailing List

Mailing lists combine the wisdom of a whole community on a particular topic. For data journalists, the Data Driven Journalism List and the NICAR-L lists are excellent starting points. Both of these lists are filled with data journalists and Computer-Assisted Reporting (CAR) geeks, who work on all kinds of projects.

Chances are that someone may have done a story like yours, and may have an idea of where to start, if not a link to the data itself. You could also try searching for mailing lists on the topic or in the region that you are interested in.


Ask an Expert

Professors, public servants, and industry people often know where to look. Call them. Email them. Accost them at events. Show up at their office. Ask nicely. "I'm doing a story on X. Where would I find this? Do you know who
has this?"

Join Hacks/Hackers, this is a rapidly expanding international grassroots journalism organization with dozens of chapters and thousands of members across four continents. Its mission is to create a network of journalists (hacks) and technologists (hackers) who rethink the future of news and information. With such a broad network, you stand a strong chance of someone knowing where to look for the thing you seek.


Ask a Forum

Search for existing answers or ask a question at Get The Data or Quora.

Get The Data is Q&A site where you can ask your data-related questions, including where to find data relating to a particular issue, how to query or
retrieve a particular data source, what tools to use to explore a data-set in a visual way, how to cleanse data, or get it into a format you can work with.

If you've looked around and still can't get hold of the data you need, then you may wish to file a formal request.

Gathering Data

Freedom of Information (FOI)

If you believe that a government body has the data you need, a Freedom of Information request may be your best
tool.

Before you make a Freedom of Information (FOI) request, you should check to see if the data you are looking for is already available or has already been requested by others.

Rules and Rights

Rules and Rights

Plan Ahead

Plan Ahead

Untitled Slide
Plan Ahead

You will save time by submitting a request at the beginning of your research and carrying out other investigations in parallel.

Be prepared for delay: sometimes public bodies take a while to process requests, so it is better to expect this.

Keep it focused. A request for information held by one part of a public authority will be answered quicker than one which requires a search across the authority.

Submit multiple requests. There is nothing to stop you submitting the request with two, three, or more bodies at the same time.

Rules and Rights

Before you start submitting a request, check the rules about fees for either submitting requests or receiving information.

Find out what your rights are before you begin, so you know where you stand and what the public authorities are and are not obliged to do.

Keep a record, make your request in writing and save a copy, so that in the future you are able to demonstrate that your request was sent

Be specific, before you submit your request, think: is it in any way ambiguous?

Gathering Data

Freedom of Information (FOI)

If you want to analyze, explore, or manipulate data using a computer, then you should explicitly ask for data
in an electronic, machine-readable format.

You may wish to clarify this by specifying, for example, that you
require budgetary information in a format "suitable for analysis with accounting software." You may also
wish to explicitly ask for information in disaggregated or granular form.

Exemption

Exemption

Untitled Slide
Exemption from FOl Laws

You may wish to find out about NGOs, private companies, religious organizations, and/or other organizations that are not required to release documents under FOl laws.

However, it is possible to find information about them by asking public bodies, which are covered by FOI laws. For example, you could ask a government
department or ministry if they have funded or dealt with a specific private company or NGO and request supporting documents. If you need further help with making your FOI request, you can also consult the Legal Leaks toolkit for journalists.

Gathering Data

Getting Data from the Web

When you have found data on the Web, but no download options are available and copy-paste has failed you, there may still be a way to get the data out. For example you can:

• Get data from web-based APIs
• Extract data from PDFs
• Screen scrape websites
With all those great technical options, don't forget the simple options: often it is worth it to spend some time searching for a file with machine-readable data or to call the institution that is holding the data you want.

Readable Data

Readable Data

Untitled Slide
Machine-Readable Data

The goal for most retrieval methods is to get access to machine-readable data. Machine-readable data is created for processing by a computer, instead of the presentation to a human user. The structure of such data relates to contained information, and not the way it is displayed eventually.

Examples of easily machine-readable formats include CSV, XML, JSON, and Excel files, while formats like Word documents, HTML pages, and PDF files are
more concerned with the visual layout of the information.

PDF, for example, is a language that talks directly to your printer; it's concerned with position of lines and dots, rather than distinguishable characters.

Gathering Data

Getting Data from the Web

Web Scraper

Web Scraper

Everyone has done this: you go to a website, see an interesting table and try to copy it over to Excel so you can
add some numbers up or store it for later.

Yet this often does not really work, or the information you want is spread across a large number of sites.

The advantage of scraping is that you can do it with virtually any website, from weather forecasts to government spending, even if that site does not have an API for raw data access.

Scraping Limits

Scraping Limits

Untitled Slide
Scraping Limits

There are, of course, limits to what can be scraped. Some factors that make it harder to scrape a site include:

• Badly formatted HTML code with little or no structural information
• Authentication systems that are supposed to prevent automatic access
• Session-based systems that use browser cookies
• A lack of complete item listings and possibilities for wildcard search
• Blocking of bulk access by the server administrators

Another set of limitations are legal barriers: some countries recognize database rights, which may limit your right to reuse information that has been published online.



Web Scraper

Web scrapers are usually small pieces of code written in a programming language such as Python, Ruby, or PHP. Choosing the right language is largely a question of which community you have access to: if there is someone in your newsroom or city already working with one of these languages, then it makes sense to adopt the same language.

While some of the click-and-point scraping tools mentioned before may be helpful to get started, the real complexity involved in scraping a website is in addressing the right pages and the right elements within these pages to extract the desired information. These tasks aren't about programming, but understanding the structure of the website and database.

Gathering Data

Getting Data from the Web

Element Types

Element Types

When displaying a website, your browser will make use of two technologies: HTTP, to communicate with the server; and HTML, the language in which websites are composed.

Any HTML page is structured as a hierarchy of boxes (which are defined by HTML " tags"). There are many types of tags that perform different functions

Tags can also have additional properties (unique identifiers) and can belong to groups that make it possible to target and capture individual elements within a document.

Untitled Slide
Element Types

To scrape web pages, you'll need to learn a bit about the different types of elements that can be in an HTML document. For example;

element wraps a whole table
(table row) elements for its rows
(table data) for each cell

The most common element type you will encounter is , which can basically mean any block of content. The easiest way to get a feel for these elements is by using the developer toolbar in your browser: they will allow you to hover over any part of a web page and see what the underlying code is.

Google Cache

Google Cache

Google Cache

View Source

View Source

Wayback

Wayback

Bit.ly

Bit.ly

Gathering Data

Images, Web Pages and Videos

Sometimes you're interested in the activity that's surrounding a particular story, rather than an entire website. The tools below give you different angles on how people are reading, responding to, copying, and sharing content on the web.

Sometimes you need to know the source of an image. Software called, TinEye, offers a reverse image search, it will take the image you have, and find other pictures on the web that look similar.

It works even when a copy has been cropped, distorted, or compressed.

Untitled Slide
Bit.ly

Turn to bit.ly when you want to know how people are sharing a particular link with each other. To use it, enter the URL you're interested in. Then click on the Info Page+ link. That takes you to the full statistics page (choose "aggregrate bit.ly link" first if you're signed in to the service).

This will give you an idea of how popular the page is, including activity on Facebook and Twitter, and below that you'll see public conversations about the link provided by backtype.com. This combination of traffic data and
conversations very helpful when I'm trying to understand why a site or page is popular, and who exactly its fans are.

Internet Archive's Wayback Machine

If you need to know how a particular page has changed over a longer time period, like months or years, the Internet Archive runs a service called The Wayback Machine that periodically takes snapshots of the most popular pages on the web.

Go to the site, enter the link you want to research, and if it has any copies, it
will show you a calendar so you can pick the time you'd like to examine. It will then present a version of the page roughly as it was at that point. It will often be missing styling or images, but it's usually enough to understand what the focus of that page's content was then.

View Source

It's a bit of a long shot, but developers often leave comments or other clues in the HTML code that underlies any page. It will be on different menus depending on your browser, but there's always a "View source" option that will let you browse the raw HTML.

You don't need to understand what the machine-readable parts mean, just keep an eye out for the pieces of text that are often scattered amongst them. Even if they're just copyright notices or mentions of the author's names, these can often give important clues about the creation and purpose of the page.

Google Cache

When a page becomes controversial, the publishers may take it down or alter it without acknowledgment. If you suspect you're running into the problem, the first place to turn is Google's cache of the page as it was when it did its last crawl. The frequency of crawls is constantly increasing, so you'll have the most luck if you try this within a few hours of the suspected changes.

Enter the target URL in Google's search box, and then click the triple arrow on the right of the result for that page. A graphical preview should appear, and if you're lucky, there will be a small "Cache" link at the top of it. Click that to see Google's snapshot of the page. You may want to take a screenshot or copy-paste any relevant content you do find, since it may be invalidated at any time by a subsequent crawl.

Gathering Data

Using and Sharing Data

Obtaining data has never been easier. Before the widespread publishing of data on the Web, even if you had identified a data-set you needed, you'd need to ask whoever had a copy to make it accessible to you.

Now, your computer asks their computer to send a copy it a copy.

Conceptually similar, but you have a copy right now, and they (the creator or publisher) haven't done anything, and probably have no idea that you have downloaded a copy.

Restrictions

Restrictions

Permission

Permission

Untitled Slide
Permission

A publisher of a database, can remove restrictions by granting permission in advance. This can be done by releasing the database under a public license or public domain dedication; just as many programmers release their code under a free and open source license, so that others can build on their code.

There are lots of reasons for opening up your data. For example, your audience might create new visualizations or applications with it, that you can link to.
Data-sets can be combined with other data-sets to give you and your readers greater insight into a topic.

Things that others do with your data might give you leads for new stories, or ideas for new projects.

Restrictions

What if, you intend to publish not just your analysis, including some facts or data points, but also the datasets/databases you used-and perhaps added to-in conducting your analysis? If you're using data collected by some other entity, there could be a hitch.

If the copyright holder hasn't given permission to use a work (or the work is in the public domain or your use might be covered by exceptions and limitations), and you use or distribute the work anyway, the copyright holder could force you to stop. Although facts are free, collections of facts can be restricted, a database can still be subject to copyright, the same as a creative work.