[Benchmarking] Using Tiger 2008 data
Jeff McKenna
jmckenna at gatewaygeomatics.com
Sun Sep 27 16:09:13 EDT 2009
Hi Andrea,
Thanks for your thorough review of the TIGER 2008 merged dataset. My
comments are inline below:
Andrea Aime wrote:
> Hi,
> I've been looking at bit at the Tiger 2008 data for
> Texas that Jeff provided, here are some findings and impressions
> on the merged set (the non merged set is only of interest of
> MapServer I think).
true the non-merged set is only of interest to MapServer right now
(we'll test both for MapServer, in the hopes that the numbers show
MapServer users something)
>
> EDGES_MERGE
> ----------------------------------------------------------------
>
> The edges_merged.shp file contains both roads and water lines.
> The classification we had for roads is still applicable with
> the small changes Jeff already suggested on the wiki page here:
> http://wiki.osgeo.org/wiki/Texas_roads_styled
> One major difference between this set and the old one is that
> it does not contain only roads, so the style will perform a
> filtering on top of the data and display only roads (and only
> certain road classes, not all of them), whilst the old
> data set only had roads and displayed them all.
Yes this 'filtering' will be a great test of the response time from the
mapping servers. Usually I would pre-process the data so that each type
of road is its own shapefile (e.g. interstates.shp, major-roads.shp),
but I'm curious to see how we all do with having to filter these
on-the-fly.
> I guess this makes for an interesting comparison between
> spatial database and shapefile (and will also point out
> systems that are capable of indexing the attributes as well
> in a shapefile, provided there are any... maybe ArcGis).
> I guess we'll want to index the mtfcc attribute in PostGIS
> to speed up searches.
>
For the MapServer case, OGR does support attribute indexing for
shapefiles ('ogrinfo -sql "CREATE INDEX ON edges_merge USING MTFCC"
using edges_merge.shp'), but I believe that that attribute index would
only help if we were using MapServer to query that 'MTFCC' field (Daniel
am I correct on this?)
> POINTLM_MERGE
> ----------------------------------------------------------------
>
> The pointlm_merge file contains point landmarks, and could
> be used to replace the gnis_names layer.
> I've tried to port the styling made by ESRI over to the
> pointlm file, but doing so resulted in loosing half
> of the categories.
Are the ESRI gnis_names stylings (class, icon) posted on the wiki somewhere?
> I've also made a comparison of the data amounts and distributions,
> see here:
>
> select count(*), mtfcc from pointlm_merge group by mtfcc;
> count | mtfcc
> -------+-------
> 1 | C3071
> 68 | K3544
> 3 | K2181
> 3 | K2110
> 84 | K2165
> 1 | C3066
> 1 | K2182
> 214 | K1231
> 599 | K2543
> 8 | K1236
> 41 | K2561
> 11854 | C3061
> 152 | K2190
> 32 | K2582
> 100 | C3062
> 20 | K1225
> 257 | K2451
> 3 | K1237
>
>
> select count(*) from pointlm_merge;
> count
> -------
> 13441
>
> As you can see there are only 13441 points, and the vast
> majority of them are C3061, "Cul de Sac". Won't make for
> a very interesting map imho.
>
> Compare with the gnis_map filtering over the data that
> is available in Texas:
>
> select count(*) from gnis_names_pg where state = 'TX';
> count
> -------
> 95132
>
> select count(*) as cnt, class from gnis_names_pg where state = 'TX'
> group by class order by cnt;
>
> cnt | class
> -------+-----------------------
> 1 | Crater
> 1 | Bench
> 2 | Slope
> 2 | Tunnel
> 5 | Arch
> 5 | Rapids
> 6 | Plain
> 9 | Forest
> 11 | Reserve
> 17 | Woods
> 19 | Arroyo
> 19 | Harbor
> 21 | Pillar
> 27 | Beach
> 27 | Area
> 29 | Falls
> 55 | Bar
> 56 | Mine
> 57 | Post Office
> 61 | Crossing
> 63 | Range
> 65 | Basin
> 67 | Military (Historical)
> 87 | Levee
> 110 | Ridge
> 130 | Bridge
> 146 | Channel
> 157 | Gap
> 180 | Flat
> 185 | Cliff
> 231 | Swamp
> 259 | Gut
> 259 | Bend
> 261 | Island
> 262 | Civil
> 279 | Cape
> 283 | Bay
> 299 | Canal
> 504 | Trail
> 579 | Hospital
> 942 | Well
> 1052 | Tower
> 1243 | Spring
> 1294 | Oilfield
> 1757 | Airport
> 1780 | Lake
> 2117 | Summit
> 2844 | Valley
> 3795 | Building
> 4008 | Park
> 5947 | Dam
> 6016 | Cemetery
> 7980 | Locale
> 8511 | Populated Place
> 8542 | Reservoir
> 8756 | School
> 11640 | Stream
> 12072 | Church
>
> I would say this is much more interesting, and the
> work to define a style for it has already been done.
> I suggest we ignore pointlm_merge and keep on using gnis_names
> instead.
Great comparison, yes I agree that we should ignore 'pointlm_merge.shp'
>
> If we really want to use contemporary data (the major reason why
> Jeff gathered the TIGER 2008 set no?) we can have someone download and
> convert the current GNIS names for Texas, available here
> as a csv file: http://geonames.usgs.gov/domestic/download_data.htm
> It has a bit more points (108k) but the classification appears to
> be the same
Agreed, I've uploaded a processed file for 2009
(/opt/data/GNIS-2009/gnis_names09.shp), and I've updated the wiki.
I've also started a file in SVN to record data sources
(/benchmarking/docs/data-sources.txt). Can someone who knows the
sources of the Raster data please update this file in SVN? thanks.
>
> AREAWATER_MERGE
> ----------------
>
> The file contains water polygons (lakes and such), it's quite sparse
> and has 368303 polygons over the Texas state.
> It seems to make a nice replacement for the tiger_tracts dataset,
> which is nation wide but contains only 4388 polygons in Texas.
>
> I suggest we use the areawater data set for the polygon test,
> using a uniform bluish fill color with no outline?
agreed.
>
> OTHER FILES
> -----------
>
> arealm_merge.shp is another polygon file, but has few polygons inside.
> tl_2008_48_place.shp is a point file, not so bit, and to my surprise
> it cannot be imported into PostGIS using shp2pgsql (charset issues).
it probably requires the "-W latin1" switch, which the geonames file
also required to import into PostGIS (lesson learned here!).
tl_2008_48_place.shp is really just an "urban areas" polygon file.
>
> I guess we can safely ignore these two?
Sure, sounds good.
-jeff
More information about the Benchmarking
mailing list