[gdal-dev] Parllelization slows down single gdal_calc process in python

Jerl Simpson jsimpson at wxtrends.com
Thu Mar 3 04:14:48 PST 2016


Hi Lorenzo:

This is more of a question for the python community.  However, a couple
things I have noticed.  Pandas tends to be much slower than working in
numpy directly.
I never saw an improvement in timings when using Pool().  What I do is
utilize Process() and Queue() or JoinableQueue() from the multiprocessing
library.

You can setup a pool of workers that all read from the same input queue.
Then you can just feed your data into the queue so the workers can process
the maps.





Jerl Simpson
Sr. Systems Engineer
Weather Trends Internationalhttp://www.weathertrends360.com/

This communication is privileged and may contain confidential information.
It's intended only for the use of the person or entity named above.
If you are not the intended recipient, do not distribute or copy this
communication.
If you have received this communication in error,
please notify the sender immediately and return the original to the
email address above.
© Copyright 2016 Weather Trends International, Inc.


On Wed, Mar 2, 2016 at 5:44 PM, Lorenzo Bottaccioli <
lorenzo.bottaccioli at gmail.com> wrote:

> Hi,
> I'm trying to parallelize a code for raster calculation with Gdal_calc.py,
> but i have relay bad results. I need to perform several raster operation
> like FILE_out=FILA_a*k1+FILE_b*k2.
>
> This is the code I'm usign:
>
> import pandas as pdimport osimport timefrom multiprocessing import Pool
>
> df = pd.read_csv('input.csv', sep=";", index_col='Date Time', decimal=',')
> df.index = pd.to_datetime(df.index, unit='s')
>
> start_time = time.time()
> pool=Pool(processes=8)
> pool.map(mapcalc,[df.iloc[i*20:(i+1)*20] for i in range(len(df.index)/20+1)])
> pool.close()
> pool.join()print("--- %s seconds ---" % (time.time() - start_time))
>
> def mapcalc(df):
>
>     month={1:'17',2:'47',3:'75',4:'105',5:'135',6:'162',7:'198',8:'228',9:'258',10:'288',11:'318',12:'344'}
>     hour={4:'04',5:'05',6:'06',7:'07',8:'08',9:'09',10:'10',11:'11',12:'12',13:'13',14:'14',15:'15',16:'16',17:'17',18:'18',19:'19',20:'20',21:'21',22:'22'}
>     minute={0:'00',15:'15',30:'30',45:'45'}
>     directory='/home/user/Raster/'
>     tmp='/home/usr/tmp/'
>     for i in df.index:
>         if 4<=i.hour<22:
>             #try:
>         timeg=time.time()
>             os.system('gdal_calc.py -A '+directory+'filea_'+month[i.month]+'_'+hour[i.hour]+minute[i.minute]+' -B '+directory+'fileb_'+month[i.month]+'_'+hour[i.hour]+minute[i.minute]+' --outfile='+tmp+str(i.date())+'_'+str(i.time())+' --calc=A*'+str(df.ix[i,'k1'])+'+B*'+str(df.ix[i,'k2']))
>             print(i,"--- %s seconds ---" % (time.time() - timeg))
>
> If i run the code with out parallelization it takes around 650s to
> complete the calculation. Each process of the for loop is executed in ~10s.
> If i run with parallelization it takes ~900s to complete the procces and
> each process of the for loop it takes ~30s.
>
> How is that? how can i Fix this?
>
> Best L
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20160303/a2a88327/attachment-0001.html>


More information about the gdal-dev mailing list