[Live-demo] "big" data sets on OSGeo Live
Hamish
hamish_b at yahoo.com
Mon Jan 28 13:06:43 PST 2013
Peter wrote:
> as space is running short on the
> ISO we need to reconsider sizings. In
> today's OSGel Live chat I was
> tasked to initiate this discussion.
> Current disk footprint of each
> application is available from
> https://docs.google.com/spreadsheet/ccc?key=0Al9zh8DjmU_RdGIzd0VLLTBpQVJuNVlHMlBWSDhKLXc#gid=13
>
> Top riders currently:
> - mapguide: 550mb
(where are you getting your numbers from? we disabled that
when we ran out of disc space about 3-4 releases ago)
> - marble: 300 mb (includes disk caching!)
> - grass >250mb
(again, is that up to date? we've since dropped the large NC
sample dataset in favour of a stripped down one and the GeoTiffs
and SHP versions to be used by all)
> One way of saving space is to
> enlarge it (ie, go to 8 GB) which is
> unclear due to some unresolved
> questions; another one is to deflate
> unused datasets, a third one to
> share data among applications.
>
> Can I ask all those with 3-digit
> disk hunger to chime into this
> discussion.
Hi,
(fwiw & I'm not sure how the numbers put in the spreadsheet were calculated so I may speak out of turn)
In general the disc space numbers as provided by the "before and after" disk space free in the (non-chroot) build logs are wildly inaccurate and deceiving. What ultimately matters is installed compressed size. Raster satellite images and video tutorials are the worst here unfortunately since they are the least compressible.
In the case of Marble, you can ignore disc caching as it is empty by default and isn't stored in the ISO. "apt-cache show" has the marble-data package at Installed-Size: 20736. Probably looking at the filesize of the the .deb packages are a good indication of the on-disc compressed size. I suspect there are other KDE apps on the disc and Marble is just taking the rap as the first one installed.
GRASS is in a similar situation. By chance and history it gets installed one script ahead of QGIS, and so brings in all the many dependencies used by both. Switch the order and QGIS looks huge*. Also perhaps GRASS's biggest package is its docs, man pages and html-- huge on raw space bug compress extremely well. We have 67mb total sample data (compressed) which I admit is a big bite; but I'd note that we got rid of the large 135mb North Carolina dataset some months ago after installing the system-wide geotiff and shapefiles versions.
[*] probably in large part the dependency on gdal-dev, since it depends on lots and lots of other -dev packages. But that's one of the most important group of packages on the disc and commonly used so I'd be loathe to think about cutting it.
Inline and at the end of main.sh (no longer used in the mainstream build) are a number of tests on the completed file system for disc space hogs. It listed before and after disk space free, and the top 75 biggest packages installed:
echo "Show top 75 packages hogging the most space on the disc:"
dpkg-query --show --showformat='${Package;-50}\t${Installed-Size}\t${Status}\n' \
| sort -k 2 -n | grep -v deinstall | tac | head -n 75 | \
awk '{printf "%.3f MB \t %s\n", $2/(1024), $1}'
But that is deceptive too, again with the "take a guess how well it compressed" uncompressed size, but also it misses out on all(?) of the Java apps which are not in .deb pkgs and often hundreds of MB each (see tarball downloads in the build log). But there too.. they often share common files (e.g. each their own copy of tomcat and JAI) and as long as that's at the same version the `fslint` step in build_iso.sh (that's still part of the build process, right?) hardlinks all the duplicates together making them much more efficient.
A great tool for exploring disk use is `filelight`, it's already installed on the Live DVD.
So in summary, many grains of salt are needed. The perennial low hanging fruit from my perspective was getting the Java apps to used shared system libs and tools, but unfortunately that's not really how Java apps expect to work.
We have the basics for supporting online datasets & tutorials from the main desktop icons, we could and should make better use of that.
As a target, I'd like to see >= 500mb space free on a 4.3gb DVD for installers and/or from the 3.8gb(?) for vFAT formatted live 4gb USB stick (which means we are about 200mb over the soft goal but still viable). The USB stick chews through free space very quickly as every change to the base system gets stored as a binary diff applied at boot time- the original image remains compressed and untouched. "apt-get upgrade" is likely to fill it all. Our best hope I think is to wait it out, for 8gb to be the cheapest USB sticks that conferences can buy.
I had to hack the USB creation tool's code on the disc to allow us to write to a bootable 4gb usb stick, the default was to grey out the device if there would be less than a gig of free space on it.
hope it helps,
Hamish
More information about the Osgeolive
mailing list