[gdal-dev] VRT derived band pixel functions written in Python

Mon Sep 12 07:59:34 PDT 2016

Hi Even,

This is a really cool and really impressive feature!
Calling Python code from C++ without development packages as a dependency sounds like black magic to me. Obviously Python symbols have to be there at some point to execute Python code, so this is only usable from a binary that happens to load them already (CPython, QGIS, etc.), correct? In particular you say that it is "done at run-time by dynamically loading Python symbols". But I'm confused because later you mention incompatibilities issues between the CPython and Pypy API. GDAL's secret sauce, I guess...?
I'm also curious why it's possible to use numba but no pypy, which AFAIK are both python JITs? And finally did you consider using Cython (which claims pypy compatibility)?

Cheers,

Victor Poughon

________________________________________
De : gdal-dev [gdal-dev-bounces at lists.osgeo.org] de la part de Even Rouault [even.rouault at spatialys.com]
Envoyé : lundi 12 septembre 2016 14:31
À : gdal-dev at lists.osgeo.org
Objet : [gdal-dev] VRT derived band pixel functions written in Python

Hi,

I wanted to mention a new (and I think pretty cool ) feature I've added in
trunk: the possibility to define pixel functions in Python in a VRT derived
band.

Let's start with a simple example to multiply a raster by 1.5 :

<VRTDataset rasterXSize="20" rasterYSize="20">
  <SRS>EPSG:26711</SRS>
  <GeoTransform>440720,60,0,3751320,0,-60</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1" subClass="VRTDerivedRasterBand">
    <PixelFunctionType>multiply</PixelFunctionType>
    <PixelFunctionLanguage>Python</PixelFunctionLanguage>
    <PixelFunctionArguments factor="1.5"/>
    <PixelFunctionCode><![CDATA[
import numpy as np
def multiply(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize,
                   raster_ysize, buf_radius, gt, **kwargs):
    factor = float(kwargs['factor'])
    out_ar[:] = np.round_(np.clip(in_ar[0] * factor,0,255))
]]>
    </PixelFunctionCode>
    <SimpleSource>
      <SourceFilename relativeToVRT="1">byte.tif</SourceFilename>
    </SimpleSource>
  </VRTRasterBand>
</VRTDataset>

or to add 2 (or more) rasters:

<VRTDataset rasterXSize="20" rasterYSize="20">
  <SRS>EPSG:26711</SRS>
  <GeoTransform>440720,60,0,3751320,0,-60</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1" subClass="VRTDerivedRasterBand">
    <PixelFunctionType>add</PixelFunctionType>
    <PixelFunctionLanguage>Python</PixelFunctionLanguage>
    <PixelFunctionCode><![CDATA[
import numpy as np
def add(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize,
                   raster_ysize, buf_radius, gt, **kwargs):
    np.round_(np.clip(np.sum(in_ar, axis = 0, dtype = 'uint16'),0,255),
              out = out_ar)
]]>
    </PixelFunctionCode>
    <SimpleSource>
      <SourceFilename relativeToVRT="1">byte.tif</SourceFilename>
    </SimpleSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="1">byte2.tif</SourceFilename>
    </SimpleSource>
  </VRTRasterBand>
</VRTDataset>

You can put any python module code inside PixelFunctionCode, with at least one
function with the following arguments :
- in_ar: list of input numpy arrays (one numpy array for each source)
- out_ar: output numpy array to fill (initialized at the right dimensions and
with the VRTRasterBand.dataType)
- xoff: pixel offset to the top left corner of the accessed region of the band
- yoff: line offset to the top left corner of the accessed region of the band
- xsize: width of the region of the accessed region of the band
- ysize: height of the region of the accessed region of the band
- raster_xsize: total with of the raster band
- raster_ysize: total height of the raster band
- buf_radius: radius of the buffer (in pixels) added to the left, right, top
and bottom of in_ar / out_ar
- gt: geotransform
- kwargs: dictionnary with user arguments defined in PixelFunctionArguments

For basic operations, you just need to care about in_ar and out_ar.

With all that, you can do interesting stuff like hillshading (code ported from
gdaldem):

<VRTDataset rasterXSize="121" rasterYSize="121">
  <SRS>EPSG:4326</SRS>
  <GeoTransform>-80.004166666666663,0.008333333333333,0,
44.004166666666663,0,-0.008333333333333</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1" subClass="VRTDerivedRasterBand">
    <ColorInterp>Gray</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="1">n43.dt0</SourceFilename>
    </SimpleSource>
    <PixelFunctionLanguage>Python</PixelFunctionLanguage>
    <PixelFunctionType>hillshade</PixelFunctionType>
    <PixelFunctionArguments scale="111120" z_factor="30" />
    <PixelFunctionCode>
      <![CDATA[
# Licence: X/MIT
# Copyright 2016, Even Rouault
import math

def hillshade_int(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize,
                         raster_ysize, radius, gt, z, scale):
    ovr_scale_x = float(out_ar.shape[1] - 2 * radius) / xsize
    ovr_scale_y = float(out_ar.shape[0] - 2 * radius) / ysize
    ewres = gt[1] / ovr_scale_x
    nsres = gt[5] / ovr_scale_y
    inv_nsres = 1.0 / nsres
    inv_ewres = 1.0 / ewres

    az = 315
    alt = 45
    degreesToRadians = math.pi / 180

    sin_alt = math.sin(alt * degreesToRadians)
    azRadians = az * degreesToRadians
    z_scale_factor = z / (8 * scale)
    cos_alt_mul_z_scale_factor = \
              math.cos(alt * degreesToRadians) * z_scale_factor
    cos_az_mul_cos_alt_mul_z_scale_factor_mul_254 = \
                254 * math.cos(azRadians) * cos_alt_mul_z_scale_factor
    sin_az_mul_cos_alt_mul_z_scale_factor_mul_254 = \
                254 * math.sin(azRadians) * cos_alt_mul_z_scale_factor
    square_z_scale_factor = z_scale_factor * z_scale_factor
    sin_alt_mul_254 = 254.0 * sin_alt

    for j in range(radius, out_ar.shape[0]-radius):
        win_line = in_ar[0][j-radius:j+radius+1,:]
        for i in range(radius, out_ar.shape[1]-radius):
            win = win_line[:,i-radius:i+radius+1].tolist()
            x = inv_ewres * ((win[0][0] + win[1][0] + win[1][0] + win[2][0])-\
                             (win[0][2] + win[1][2] + win[1][2] + win[2][2]))
            y = inv_nsres * ((win[2][0] + win[2][1] + win[2][1] + win[2][2])-\
                             (win[0][0] + win[0][1] + win[0][1] + win[0][2]))
            xx_plus_yy = x * x + y * y
            cang_mul_254 = (sin_alt_mul_254 - \
                (y * cos_az_mul_cos_alt_mul_z_scale_factor_mul_254 - \
                    x * sin_az_mul_cos_alt_mul_z_scale_factor_mul_254)) / \
                math.sqrt(1 + square_z_scale_factor * xx_plus_yy)
            if cang_mul_254 < 0:
                out_ar[j,i] = 1
            else:
                out_ar[j,i] = 1 + round(cang_mul_254)

def hillshade(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize,
              raster_ysize, radius, gt, **kwargs):
    z = float(kwargs['z_factor'])
    scale= float(kwargs['scale'])
    hillshade_int(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize,
                  raster_ysize, radius, gt, z, scale)

]]>
    </PixelFunctionCode>
    <BufferRadius>1</BufferRadius>
    <SourceTransferType>Int16</SourceTransferType>
  </VRTRasterBand>
</VRTDataset>

You can completely offload the python code itself into a proper my_lib.py file
and just specify
    <PixelFunctionType>my_lib.hillshade</PixelFunctionType>

Technically, the interfacing with Python is done at run-time by dynamically
loading Python symbols with dlopen()/GetProcAddress(), when they are already
available in the process. For example if libgdal is loaded from a Python
interpreter, or from a binary like QGIS which has already loaded the Python
lib. Otherwise a few shared objects ("libpython2.7.so", "python27.dll",
"libpython3.4m.so", etc.) are tried, unless the PYTHONSO config option is
specified to point to a precise filename. The advantage of this approach is that
the same GDAL library binary is compatible of all Python 2.X (tested: 2.6,
2.7) and 3.X (tested: 3.1, 3.4) versions, and any numpy version that comes in
the Python environment used (at compilation time you don't need any
python/numpy development package). The numpy dependency is not a critical one:
one could imagine a fallback mode where Python arrays would be used instead,
but this has likely little interest.

Successfully tested on Linux, MacOSX, FreeBSD and Windows. Some extra tweaking
of the predefined set of shared object names - that are probed when no already
loaded Python environment is found - might be needed.
The implementation should be thread-safe regarding use of the Python Global
Interpreter Lock (GIL).

I've also tested with the numba Just-In-Time compiler
(http://numba.pydata.org/) and it provides major performance improvements for
highly computational code (example given below).
With numba enabled, I found that my above Python hillshading on a 10Kx10K
float32 raster was even faster than gdaldem (the reason is that the Python
version is a bit simplified as it doesn't take into account input nodata
values), whereas the non-jit'ed one is 100x slower. When removing the nodata
flag, gdaldem is only twice faster as the jit'ed python code. So this is a good
sign that such approach isn't only a toy or just for prototyping.
Speaking of JIT, there's no provision (yet?) for interfacing with PyPy. Would
require a new backend as PyPy C API has nothing to do with the CPython one.

There are obvious security concerns in allowing Python code to be run when
getting the content of a vrt file. The GDAL_VRT_ENABLE_PYTHON config option =
IF_SAFE / NO / YES can be set to control the behaviour. The default is IF_SAFE
(can be change at compilation time by defining
-DGDAL_VRT_ENABLE_PYTHON_DEFAULT="NO" e.g. And Python code execution can be
completely disabled with -DGDAL_VRT_DISABLE_PYTHON). Safe must be understood
as: the code will not read, write, delete... files, spawn external code, do
network activity, etc. Said otherwise, the code will only do "maths". But
infinite looping is something definitely possible in the safe mode. The
heuristics of the IF_SAFE mode is rather basic and I'd be grateful if people
could point ways of breaking it. If any of the following strings - "import"
(unless it is "import numpy" / "from numpy import ...", "import math" / "from
math import ..." or "from numba import jit"), "eval", "compile", "open",
"load", "file", "input", "save", "memmap", "DataSource", "genfromtxt",
"getattr", "ctypeslib", "testing", "dump", "fromregex" - is found anywhere in
the code, then the code is considered unsafe (there are interestingly a lot of
methods in numpy to do file I/O. Hopefully I've captured them with the previous
filters). Another 'unsafe' pattern is when the pixel function references an
external module like my above my_lib.hillshade example (who knows if there
will not be some day a shutil.reformat_your_hard_drive function with the right
prototype...)

This new capability isn't yet documented in the VRT doc, although this message
will be a start.

I'm interested in feedback you may have.

And to conclude with a fun example: a raster with a Mandelbrot fractal. Just a
grey-level version. Let to the reader as an exercice: add a color table. To be
opened for example in QGIS and enjoy the almost infinite zoom feeling. Make
sure to disable contrast enhancement. It uses numba when available, and when
this is the case, it's really fast when paning/zooming.

<VRTDataset rasterXSize="100000000" rasterYSize="100000000">
  <VRTRasterBand dataType="Byte" band="1" subClass="VRTDerivedRasterBand">
    <PixelFunctionLanguage>Python</PixelFunctionLanguage>
    <PixelFunctionType>mandelbrot</PixelFunctionType>
    <PixelFunctionCode>
      <![CDATA[
try:
    from numba import jit
    #print('Using numba')
    g_max_iterations = 100
except:
    class jit(object):
        def __init__(self, nopython = True, nogil = True):
            pass

        def __call__(self, f):
            return f

    #print('Using non-JIT version')
    g_max_iterations = 25

# Use a wrapper since the VRT code cannot access the jit decorated function
def mandelbrot(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize,
                        raster_ysize, r, gt, **kwargs):
    mandelbrot_jit(out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize,
g_max_iterations)

@jit(nopython=True, nogil=True)
def mandelbrot_jit(out_ar, xoff, yoff, xsize, ysize, raster_xsize,
                        raster_ysize, max_iterations):
    ovr_factor_y = float(out_ar.shape[0]) / ysize
    ovr_factor_x = float(out_ar.shape[1]) / xsize
    for j in range( out_ar.shape[0]):
        y0 = 2.0 * (yoff + j / ovr_factor_y) / raster_ysize - 1
        for i in range(out_ar.shape[1]):
            x0 = 3.5 * (xoff + i / ovr_factor_x) / raster_xsize - 2.5
            x = 0.0
            y = 0.0
            x2 = 0.0
            y2 = 0.0
            iteration = 0
            while x2 + y2 < 4 and iteration < max_iterations:
                y = 2*x*y + y0
                x = x2 - y2 + x0
                x2 = x * x
                y2 = y * y
                iteration += 1

            out_ar[j][i] = iteration * 255 / max_iterations
]]>
    </PixelFunctionCode>
    <Metadata>
      <MDI key="STATISTICS_MAXIMUM">255</MDI>
      <MDI key="STATISTICS_MEAN">127</MDI>
      <MDI key="STATISTICS_MINIMUM">0</MDI>
      <MDI key="STATISTICS_STDDEV">127</MDI>
    </Metadata>
    <ColorInterp>Gray</ColorInterp>
    <Histograms>
      <HistItem>
        <HistMin>-0.5</HistMin>
        <HistMax>255.5</HistMax>
        <BucketCount>256</BucketCount>
        <IncludeOutOfRange>0</IncludeOutOfRange>
        <Approximate>1</Approximate>
        <HistCounts>0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0</HistCounts>
      </HistItem>
    </Histograms>
  </VRTRasterBand>
</VRTDataset>

Statistics have been added just to make QGIS open the file a bit quicker.

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com
_______________________________________________
gdal-dev mailing list
gdal-dev at lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/gdal-dev