Skip to main content

CIS 245: Open Data Sources

What is Open Data?

Open data iconAccording to Opendefinition.org:

"A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike."

Open data is often associated with government datasets and APIs, but it can apply to any type of data created by anyone for any purpose. This data could be used to help answer research questions — such as "does improved state funding improve school performance?" It could also be used to drive 3rd party apps such as tracking apps for SEPTA that use SEPTA's API.

This guide provides a range of data resources along with considerations for when you are searching for or using data.

(Definition from Univ. of British Columbia libraries. Image by Open Knowledge on Flickr. Creative Commons License 2.0 CC-BY 2.0).

Evaluating Data

As with any research project, you must evaluate the sources you use.

Dates - make sure all the data you collect is dated.

  • Is it current enough for your needs?
  • Do you need a specific time period?

Author or Affiliated Organization

  • Who is the author or affiliated organization?
  • Are they considered an expert in the field?

Purpose

  • Why was the information published?
  • Is it to inform or persuade the reader?
  • Who is the intended audience?

Reliability

  • Are there references given for the information?
  • Is it an appropriate source for your assignment?

Data Limitations

Here are some limitations you may encounter as you search for data...

  • Data may not be available for the area you desire. For example: Not every police department has crime statistics at the street level. Cases of influenza are generally reported at the county level. You may have to adjust the scope of your research accordingly or find different sources.
  • Keep an eye out for mismatched geographies when comparing data sets. Business data may be available at the zip code level while the demographics are at the municipality or census block level. Similarly some statistical regions criss-cross governmental lines including legislative districts and zip codes.
  • Data may not be available for the time span desired. The data may only be collected every 5 to 10 years. Even then it can take time to process the data. For example, U.S. Census data is released over a period of 2 to 3 years.
  • Data may not be publicly available due to privacy, security or licensing issues. For example, the most recent U.S. Census available with individual names is 1940.
  • Consider the source. It is possible to "cherry-pick" data to support a biased argument so look for data from a reputable source.

Data Creators

Once you have chosen a research question, first consider the kinds of data could answer your question and who might create this data. Here are examples of data creators.

  1. Governments and their agencies
    • International: UN, World Bank
    • Federal: Census, USGS (geography-geology), Bureau of Labor Statistics, EPA (environmental), Department of Agriculture, NOAA (weather), Fish and Wildlife Service.
    • State and Local: PennDOT and BMV (transportation), Department of Health, Game Commission (wildlife), Gaming Commission (gambling), police agencies.
  2. Quasi-Governmental Agencies
    • Watershed protection agencies
    • Transit, turnpike and port authorities
  3. Non-profits/think tanks/interest groups
    • Medical advocacy associations, Pew Research, PETA (animal cruelty), Brookings and Cato institutes (political-based research groups)
  4. Universities/Colleges
    • Research centers and agricultural outreach services.
  5. Commercial entities
    • Yelp, Google, movie box office