MVHub.com ZIP code sort<
Thu Sep 6 19:51:11 EDT 2007
The ZIP code sort is now live on MVHub.com, so feel free to surf out there and have a look. The following is some background material on how ZIP code sorting works; it's a bit longwinded, so read or skip it at your leisure. --John When I started this a month ago, I had assumed that ZIP code information was in the public domain, and that ZIP codes corresponded roughly to geographical areas. Given that, we could download the public-domain ZIP info, calculate the center of each ZIP code, then do a little trig to calculate distances between ZIP codes. This is _roughly_ how things work. The USPS created the Zone Improvement Plan (ZIP) codes back in the 60s to make mail delivery more efficient. ZIP codes are assigned based on a few things. The country is divided into ZIP code regions, with each region having a unique first digit (New England = 0, West Coast = 9, etc.). Inside each region, each state gets a range of ZIPs (MA is 1000-2799), not all of which are used. Pretty obvious so far. Each individual ZIP code, however, is defined not by a geographical area, but by its carrier routes. This makes sense for the postmen, who can say "My route goes to the end of Westford Street," but when you have a set of streets that might look like: /-------/ / / / \ | \ | / ------------------ it's tough to define a unique geographic area, especially if not all the streets have addresses, or if there's a body of water involved. The Census Bureau took on this task back in 2000, and defined ZIP Code Tabulation Areas (ZCTAs). This information is seven years old, though, and covers only regular (multi-address, non-P.O. Box) ZIP codes. The USPS has also defined areas for ZIP codes and sells this information for $50/state. Commercial companies have licensed the USPS data and sell it at much more reasonable rates (approx. $50 for the entire US). Most websites these days use this commercial data. It's not too shocking that the Census Bureau and the USPS data don't quite match (pretty close, though), but it's news that Google's data doesn't always match the USPS's. For example, Google calculates the center of the Highlands neighborhood to be just southeast of Drum Hill, while Yahoo, the Census Bureau, and the USPS all put the center of the Highlands at about Stevens and Westford streets. The difference in the two locations is about a mile. Google Maps had a few other anomalous ZIP codes as well. To do MVHub's ZIP code sorting, I had initially hoped to query Google Maps for a distance, then cache the distance in our database so we didn't have to query twice. Google changed their Maps API, however, so the Perl module I was using (Geo::Google) to query Google Maps broke. The Geo::Google developers (conscientious folks that they are) sent me a patch within an hour of my bug report, but I felt a bit uneasy about relying on Geo::Google (in this case, its dependency JSON::Parser) not to break on future Google API changes. Combining that with the suggestion of perlmonks.org users that we have our own ZIP code database, we purchased ($40) a list of ZIP codes, towns, latitudes, and longitudes from zipcodedownload.com. Once we had the latitude and longitude for each ZIP code, finding the distance between two ZIPs was simply using a few lines of code from the Geo::Distance module. It was pretty straightforward to load the ZIP codes and distances into a database table, then for each MVHub program result, query the database for the corresponding distance. To sum up, I thought we could use Google to find ZIP information; this unexpectedly broke. Using an existing ZIP code -> latitude/longitude database was a far better choice, and of these databases, the USPS had the best data.