[Benchmarking] Using Tiger 2008 data

Sun Sep 27 16:09:13 EDT 2009

Hi Andrea,

Thanks for your thorough review of the TIGER 2008 merged dataset.  My 
comments are inline below:

Andrea Aime wrote:
> Hi,
> I've been looking at bit at the Tiger 2008 data for
> Texas that Jeff provided, here are some findings and impressions
> on the merged set (the non merged set is only of interest of
> MapServer I think).

true the non-merged set is only of interest to MapServer right now 
(we'll test both for MapServer, in the hopes that the numbers show 
MapServer users something)

> 
> EDGES_MERGE
> ----------------------------------------------------------------
> 
> The edges_merged.shp file contains both roads and water lines.
> The classification we had for roads is still applicable with
> the small changes Jeff already suggested on the wiki page here:
> http://wiki.osgeo.org/wiki/Texas_roads_styled
> One major difference between this set and the old one is that
> it does not contain only roads, so the style will perform a
> filtering on top of the data and display only roads (and only
> certain road classes, not all of them), whilst the old
> data set only had roads and displayed them all.

Yes this 'filtering' will be a great test of the response time from the 
mapping servers.  Usually I would pre-process the data so that each type 
of road is its own shapefile (e.g. interstates.shp, major-roads.shp), 
but I'm curious to see how we all do with having to filter these 
on-the-fly.

> I guess this makes for an interesting comparison between
> spatial database and shapefile (and will also point out
> systems that are capable of indexing the attributes as well
> in a shapefile, provided there are any... maybe ArcGis).
> I guess we'll want to index the mtfcc attribute in PostGIS
> to speed up searches.
> 

For the MapServer case, OGR does support attribute indexing for 
shapefiles ('ogrinfo -sql "CREATE INDEX ON edges_merge USING MTFCC" 
using edges_merge.shp'), but I believe that that attribute index would 
only help if we were using MapServer to query that 'MTFCC' field (Daniel 
am I correct on this?)

> POINTLM_MERGE
> ----------------------------------------------------------------
> 
> The pointlm_merge file contains point landmarks, and could
> be used to replace the gnis_names layer.
> I've tried to port the styling made by ESRI over to the
> pointlm file, but doing so resulted in loosing half
> of the categories.

Are the ESRI gnis_names stylings (class, icon) posted on the wiki somewhere?

> I've also made a comparison of the data amounts and distributions,
> see here:
> 
> select count(*), mtfcc from pointlm_merge group by mtfcc;
>  count | mtfcc
> -------+-------
>      1 | C3071
>     68 | K3544
>      3 | K2181
>      3 | K2110
>     84 | K2165
>      1 | C3066
>      1 | K2182
>    214 | K1231
>    599 | K2543
>      8 | K1236
>     41 | K2561
>  11854 | C3061
>    152 | K2190
>     32 | K2582
>    100 | C3062
>     20 | K1225
>    257 | K2451
>      3 | K1237
> 
> 
> select count(*) from pointlm_merge;
>  count
> -------
>  13441
> 
> As you can see there are only 13441 points, and the vast
> majority of them are C3061, "Cul de Sac". Won't make for
> a very interesting map imho.
> 
> Compare with the gnis_map filtering over the data that
> is available in Texas:
> 
> select count(*) from gnis_names_pg where state = 'TX';
> count
> -------
>  95132
> 
> select count(*) as cnt, class from gnis_names_pg where state = 'TX' 
> group by class order by cnt;
> 
>   cnt  |         class
> -------+-----------------------
>      1 | Crater
>      1 | Bench
>      2 | Slope
>      2 | Tunnel
>      5 | Arch
>      5 | Rapids
>      6 | Plain
>      9 | Forest
>     11 | Reserve
>     17 | Woods
>     19 | Arroyo
>     19 | Harbor
>     21 | Pillar
>     27 | Beach
>     27 | Area
>     29 | Falls
>     55 | Bar
>     56 | Mine
>     57 | Post Office
>     61 | Crossing
>     63 | Range
>     65 | Basin
>     67 | Military (Historical)
>     87 | Levee
>    110 | Ridge
>    130 | Bridge
>    146 | Channel
>    157 | Gap
>    180 | Flat
>    185 | Cliff
>    231 | Swamp
>    259 | Gut
>    259 | Bend
>    261 | Island
>    262 | Civil
>    279 | Cape
>    283 | Bay
>    299 | Canal
>    504 | Trail
>    579 | Hospital
>    942 | Well
>   1052 | Tower
>   1243 | Spring
>   1294 | Oilfield
>   1757 | Airport
>   1780 | Lake
>   2117 | Summit
>   2844 | Valley
>   3795 | Building
>   4008 | Park
>   5947 | Dam
>   6016 | Cemetery
>   7980 | Locale
>   8511 | Populated Place
>   8542 | Reservoir
>   8756 | School
>  11640 | Stream
>  12072 | Church
> 
> I would say this is much more interesting, and the
> work to define a style for it has already been done.
> I suggest we ignore pointlm_merge and keep on using gnis_names
> instead.

Great comparison, yes I agree that we should ignore 'pointlm_merge.shp'

> 
> If we really want to use contemporary data (the major reason why
> Jeff gathered the TIGER 2008 set no?) we can have someone download and
> convert the current GNIS names for Texas, available here
> as a csv file: http://geonames.usgs.gov/domestic/download_data.htm
> It has a bit more points (108k) but the classification appears to
> be the same

Agreed, I've uploaded a processed file for 2009 
(/opt/data/GNIS-2009/gnis_names09.shp), and I've updated the wiki.

I've also started a file in SVN to record data sources 
(/benchmarking/docs/data-sources.txt).  Can someone who knows the 
sources of the Raster data please update this file in SVN?  thanks.

> 
> AREAWATER_MERGE
> ----------------
> 
> The file contains water polygons (lakes and such), it's quite sparse
> and has 368303 polygons over the Texas state.
> It seems to make a nice replacement for the tiger_tracts dataset,
> which is nation wide but contains only 4388 polygons in Texas.
> 
> I suggest we use the areawater data set for the polygon test,
> using a uniform bluish fill color with no outline?

agreed.

> 
> OTHER FILES
> -----------
> 
> arealm_merge.shp is another polygon file, but has few polygons inside.
> tl_2008_48_place.shp is a point file, not so bit, and to my surprise
> it cannot be imported into PostGIS using shp2pgsql (charset issues).

it probably requires the "-W latin1" switch, which the geonames file 
also required to import into PostGIS (lesson learned here!). 
tl_2008_48_place.shp is really just an "urban areas" polygon file.

> 
> I guess we can safely ignore these two?

Sure, sounds good.

-jeff