geeky · javascript

Screen scraping census baby name data with nodeJs, cheerio(jQuery) and promises

—->My Soure Code<—-

Goal

Screen scrape census baby names data for the 51 states and produce a single CSV file.

These data are in html table via sites such as

censusStateData2012

Script must translate all of the table data from 51 states (51 requests) into a single csv file that contains

state,rank,gender,maleName,maleBirths
state,rank,gender,femaleName,femaleBirths

Tools

Design

  • Fire 51 post requests.
  • In each request, parse the html data to produce an array of javascript objects that represents the data collected.
  • Once all requests are done, loop through all of the data collected and generate the csv file.

Main Concepts

This blog entry does a fantastic job explaining the programming used in my code.

Lexical Scope/IIFE/Closure

You need to understand the lexical scope in Javascript so you can understand why async calls inside of a loop will not behave correctly unless you wrap your code in an IIFE. This is explained in details in the blog entry.

For example, the following code will NOT work correctly

for(var i = 0; i < states.length; i++) {
	var stateCode = states[i];
	request(url + stateCode, function(error, response, body){
		// this is wrong
	});
}

It needs to be like

for(var i = 0; i < states.length; i++) {
	var stateCode = states[i];
	(function(stateCode) {
		request(url + stateCode, function(error, response, body){
			// this is right
		});
	})(stateCode);
}

Promises

Promises are used to collect all data from all of the async calls.
This concept works similar to the jQuery promise e.g.

// array of promises
var ajaxCalls = [];

// unknown number of ajaxCalls
ajaxCalls.push(
	$.ajax(....)
);

var group = $.when.apply($, ajaxCalls);
group.done(function() {
    // all ajax calls are done
});

Using promise-io, the syntax will be

// lib
var promiseIo = require("promised-io/promise");
var Deferred = promiseIo.Deferred;

var allStates = []; // array of promises
for(var i = 0; i < states.length; i++) {
	allStates[i] = new Deferred();
}

// when all of the async call return
var group = promiseIo.all(allStates);
group.then(function(array){
	for(var i = 0; i < array.length; i++) {
		// array[i] contains the value returned by the promise
	}
});

// somewhere in an async call
... { ....
allStates[i].resolve('my promise value to return');
... }

Execution

  • You must first install nodeJs
  • Then download/clone my source code
  • Follow my read me instructions for
    npm install ...
    
  • Open censusBabyNamesState.js and update any variables
  • Run
    > node censusBabyNamesState.js
    

    and your csv file will be generated

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s