The course presentation consistent of 5 modules, covering a range of topics,
from a government data overview to a discussion of javascript visualization
principles.
-
All course slides A complete deck of
all ~350 slides presented during the course. (Big!)
-
Module 1: Government Data: Its Power and Flaws This module
gives a brief history and overview of government data, and discusses some critical problems with the current open government data model.
-
Module 2: Data Wrangling Techniques This module serves as an introduction to several important modern
data wrangling techniques, including web scraping, HTML parsing, and data cleaning. It also contains a brief introduction to using MongoDB.
-
Module 3: The GovData Platform This module describes the GovData Platform, both in terms of the
high-level principles by which it addresses some of the current limitations of government data, as well at the detailed software engineering level.
-
Module 4: Introduction to Javascript Visualizations This module is an introduction to modern Javascript visualization techniques.
-
Course Coding Projects These slides show several of the data parsing and visualization projects
worked on in the hands-on coding sessions.
Description
Over the past several years, there has been an explosion of interest in
increasing transparent access to all types of government data. However, despite the large amount of
interest, there have remained a number of technical and organizational
challenges in effectively realizing key open government goals.
In response to these challenges, a joint team from Harvard and MIT has
developed the GovData Platform. The GovData platform makes
advances in the online presentation of large datasets, and is
designed to be a game-changing contribution in the Open Government arena.
The one-week GovData winter course aims to develop participants' data parsing
and visualization abilities and offers a series of hands-on coding experiences in which
participants can get directly involved in GovData development. By the end
of the week, we aim to assemble a motivated team of contributors going
forward to help launch the initiative.
5-day Course Plan
Day 1 -- Government Data 101: A tour of the Open Government Initiative & GovData platform.
Day 2 -- High-powered data APIs with Python, MongoDB, GeoDjango, and Apache Solr.
Day 3 -- Hands-On Coding Experience: Parsers
Day 4 -- Interactive Visualizations with Javascript, HTML5 and processingJS.
Day 5 -- Hands-On Coding Experience: Visualizations
Dates & Times
The course will be offered twice:
@ MIT:
January 10th-14th
1-5pm daily
NOTE ROOM CHANGE:
Media Lab, Building E14, Rm 633 (Mon & Fri), Rm 240 (Tues-Wed) and Rm 525 (Thurs).
@ Harvard:
January 17-21
1-5pm daily
CGIS South, Knafel Building, Rooms K262 (Mon-Wed) & K354 (Thu-Fri).
Prerequisites
At least a year of general programming experience required. Familiarity
with the Python programming language STRONGLY recommended, and at least
some understanding of Javascript helpful.
Sponsors
Instructors
- Daniel Yamins (PhD Applied Math, Harvard, 2008) yamins@fas.harvard.edu
- Doug Fritz (PhD Candidate, MIT Media Lab) doug@media.mit.edu
Sign Up
Syllabus
The course will take place over 5 days, each session commencing at 1PM
in the afternoons, and lasting approximately 4 hours.
- Government Data 101:
- What the data sets are like, how much there is, and where they are.
- A tour of the federal data "hierarchy."
- The Open Government Initiative
- A brief history of the Open Government movement.
- data.gov and the Federal CTO/CIO initiatives.
- State and municipal efforts
- Non-governmental projects
- The Harvard-MIT GovData Project: Principles & Goals
- The GovData Platform:
- The Data Pipeline "Back-backend."
- MongoDB backend: a flexible document store.
- Solr Slice Indexing: schema for natural language search.
- Javascript Framework for Visualization Frontends.
- Python Data Parser Scripting
- Python basics.
- Writing effective incremental data parsers with the StarFlow
Workflow Management System.
- Plugging in to the GovData Parser Pipeline
- MongoDB
- MongoDB: A flexible, scalable, document store
- Mongo's Rich Query Language
- Mixed-type Data arryas as documents
- Government Data in MongoDB, including geographic/temporal
standardization
- The GovData /get API
- The GovData /sources API
- Solr Slice Indexing
- Basics of the Apache Solr framework
- Using Solr to Index MongoDB slices: unified natural language
queries across databases
- The GovData /find API
- GeoDjango
- Standard US geographic boundary layers: a "US political
ultra-gazetteer"
- PostGIS and GeoDjango Basics
- The GovData /geo API
- Integrating /geo with
/get and /find.
Participants will split into small groups to get hands-on experience building
example parsers. Possible data sets that we will work on include:
- Environmental Protection Agency
- US Census Bureau
- Bureau of Economic Analysis
- Bureau of Labor Statistics
- Bureau of Justice Statistics
- The (Mostly) Pure Javascript Approach
- Putting all the interactive rendering in JS
- Putting just the application logic, if any, in a front-end server
- Getting all data via flexible RESTful APIs
- Tools that make Javascript not suck:
- jQuery -- DOM manipulation and event handling
- Underscore.js -- Improved data structures, operations, and introspection.
- requireJS -- Module scoping, dependency management, and compression.
- jQuery.address -- deep linking, browsability & crawlability
- Tools for Javascript visualization:
- jQuery.ui -- a widget framework.
- HTML5 Canvas -- Bitmap graphics.
- HTMl5 SVG -- vector graphics.
- processingJS and Raphael -- HTML5 libraries.
- The GovData Frontend:
- The Search page -- visualizing the /find API.
- The Show Page -- visualization the /get API
(that is, the data itself).
Participants will split into small groups that will be organized to work on
aspects of a visualization. The goal will be to create a dynamic hierarchical
visualization for directed browsing of all the federal data holdings, based
on the GovData /sources API. The visualization may
include:
- dynamic tree clustering
- dendritic maps
- context-dependent filters
- sparkline-based quick data views
- natural-language autocompletion
The GovData Project: A Platform for Next-Generation
Open Government
Over the past several years, there has been an explosion of interest in
increasing transparent access to all types of government data, spurred in
part by Open Government initiatives from a variety of Federal, state and
local agencies. Despite the large amount of interest, however, there remain
a number of technical and organizational challenges to effectively realizing
key open government goals.
In response to these challenges, a joint team from Harvard and MIT has
developed the GovData Platform, a cutting-edge technology for capturing,
organizing and presenting government data on the web. This platform makes
major advances in the online presentation of large datasets, and is designed
to be a game-changing contribution in the Open Government arena.
The fundamental innovation is that GovData actually acquires and
hosts ALL the data locally.
Existing government data portals (like data.gov and data.gov.uk),
provide data catalogs: compendiums of web links to other more area-specific
government data portals. Though the metadata in these catalogs is standardized,
the individual linked-to data sites are designed and maintained completely
independently, and the underlying data is in not standardized across catalog
entries. These efforts therefore cannot support unified APIs for detailed
search and access to the data itself. Consequently, public-facing frontend
applications (like interactive data visualizations) must be written separately
for each dataset, requiring application designers to understand, compile, and
parse each data source one by one.
In contrast, the GovData project is a full-fledged data collection that
maintains updated versions of all the data on local servers. With a
complement of several hundred data parsing pipelines for acquiring and
processing data from a large range of data-producing federal agencies, GovData
centralizes and standardizes key pieces of Federal data holdings.
Leveraging the actual hosting of all the data, GovData is able to create
a number of key features that significantly advance Open Government goals:
- A standardized database and query language. Using the
scalable MongoDB document-based database system, all data is processed into
a lightly standard form that supports a consistent query syntax for all
datasets. Date and location data components are strictly standardized so
that datasets can be joined along spatiotemporal axes.
- Fine-grained natural language search. Using the powerful
Solr search framework, GovData indexes all slices of all datasets in the
holdings, providing users a unified natural language search interface to hone
in on precise portions of datasets specifically relevant to their query.
- Automated incremental update and versioning. Using the
StarFlow workflow management system, GovData establishes automated incremental
updating schedules for each dataset, so that datasets remain up to date at
all times. This process also tracks versions for individual data records,
so revisions do not cause the loss of data reproducibility.
- Rich Web APIs. Building off the underlying database and
search indexes, the GovData backend provides a spectrum of powerful unified
web-based APIs for programmatic access to data search, extraction, and
manipulation.
- Cutting-edge interactive frontend. Leveraging the GovData
APIs, a public-facing website is built on top of GovData APIs, providing clean
search and visualization tools for key spatial and temporal data components.
This site not only is not only useful on its own terms, but also provides an
example illustrating the capabilities of the GovData APIs for third-party
application designers.