[pycsw-devel] SOS Harvesting Error

Tom Kralidis tomkralidis at gmail.com
Sun Nov 2 04:44:18 PST 2014


Dan: thanks for the info.  This sounds like a plan.  Any idea of what
kind of changes this would involve?

Overall workflow to harvest SOS 1.0.0 at
http://sdf.ndbc.noaa.gov/sos/server.php:

- pycsw.server.harvest: validate request parameters
- pycsw.metadata.parse_record: detect SOS
- pycsw.metadata._parse_sos
 - parse SOS GetCapabilities
 - parse service metadata into record object (1 record)
 - parse each observation offering into record object (906 records)
 - return list of record objects (906 observations offerings + 1
service = 907 records)
- pycsw.server.harvest: begin transaction; insert / update each record
in list into repository; exception rollback
- pycsw.server.harvest: return response

It's important to know here that the harvesting of the SOS shall be
all (907 records) or (if exception) nothing.

On Wed, Oct 29, 2014 at 4:19 PM,  <dan at inlet.geol.sc.edu> wrote:
> Tom,
>
> Again, I have not followed the flow all the way through, but instead of
> building all the records at one time in _parse_sos, the problem could be
> alleviated greatly by batching them, or doing one at a time.
> I suspect this is a non starter since the other parse_xxxx methods are
> built around running an array of records.
>
> Dan
>
>> Dan: thanks for the report and issuing a ticket on GitHub.  This is
>> tough to deal with, given that it's a very specific case for failure
>> (PostgreSQL backend, CPU/VM configuration, big SOS), in terms of
>> stuffing such a big Capabilities response into a CSW backend  Perhaps
>> we can lessen what is actually harvested (let's continue in the
>> ticket).
>>
>> https://github.com/geopython/pycsw/issues/279
>>
>> Thanks
>>
>> ..Tom
>>
>>
>>
>>
>> On Tue, Oct 28, 2014 at 3:36 PM,  <dan at inlet.geol.sc.edu> wrote:
>>> The virtual server I was running only had 1Gb of memory and it was
>>> running
>>> out. I bumped it up to 4Gb and the processing is now working much
>>> better.
>>>
>>> Since the sos parsing is grabbing all the records, this could continue
>>> to
>>> be an issue. I don't know the entire data flow, but I was thinking a
>>> less
>>> memory intensive processing would be to run through the offerings wholly
>>> processing one station, then the next so the memory footprint would not
>>> continue to grow depending on the station count.
>>>
>>>
>>> Dan
>>>> SOme additional logging on line 1792 of server.py
>>>> turned up a traceback of:
>>>> Traceback (most recent call last):
>>>>   File "/home/madrona/src/pycsw/pycsw/server.py", line 1792, in harvest
>>>>     pagesize=self.csw_harvest_pagesize)
>>>>   File "/home/madrona/src/pycsw/pycsw/metadata.py", line 91, in
>>>> parse_record
>>>>     return _parse_sos(context, repos, record, identifier, '1.0.0')
>>>>   File "/home/madrona/src/pycsw/pycsw/metadata.py", line 700, in
>>>> _parse_sos
>>>>     _set(context, recobj, 'pycsw:XML',
>>>> etree.tostring(md._capabilities))
>>>>   File "lxml.etree.pyx", line 3157, in lxml.etree.tostring
>>>> (src/lxml/lxml.etree.c:69517)
>>>>   File "serializer.pxi", line 143, in lxml.etree._tostring
>>>> (src/lxml/lxml.etree.c:114600)
>>>> MemoryError
>>>>
>>>> Doing a down and dirty "top" I could see that the server was most
>>>> likely
>>>> running out of memory. The NDBC station where it finally died was
>>>> station-42915, I am harvesting against the NDBC SOS still.
>>>>
>>>>
>>>> Dan
>>>>
>>>>> I've apparently taken a step further back, I can't make the parsing
>>>>> happen
>>>>> at all now.
>>>>> On the "client end" when I run the command python bin/pycsw-admin.py
>>>>> -c
>>>>> post_xml -u http://129.252.139.196:8080 -x Harvest-sos100.xml
>>>>>
>>>>> I get the error:
>>>>> Executing HTTP POST request Harvest-sos100.xml on server
>>>>> http://129.252.139.196:8080
>>>>> Traceback (most recent call last):
>>>>>   File "bin/pycsw-admin.py", line 246, in <module>
>>>>>     print admin.post_xml(CSW_URL, XML, TIMEOUT)
>>>>>   File
>>>>> "/usr/local/virtualenv/venv-2.7.8/lib/python2.7/site-packages/pycsw/admin.py",
>>>>> line 495, in post_xml
>>>>>     raise RuntimeError(err)
>>>>> RuntimeError: timed out
>>>>>
>>>>> On the local server I see:
>>>>> Traceback (most recent call last):
>>>>>   File "/usr/local/src/python/lib/python2.7/wsgiref/handlers.py", line
>>>>> 86,
>>>>> in run
>>>>>     self.finish_response()
>>>>>   File "/usr/local/src/python/lib/python2.7/wsgiref/handlers.py", line
>>>>> 128, in finish_response
>>>>>     self.write(data)
>>>>>   File "/usr/local/src/python/lib/python2.7/wsgiref/handlers.py", line
>>>>> 212, in write
>>>>>     self.send_headers()
>>>>>   File "/usr/local/src/python/lib/python2.7/wsgiref/handlers.py", line
>>>>> 270, in send_headers
>>>>>     self.send_preamble()
>>>>>   File "/usr/local/src/python/lib/python2.7/wsgiref/handlers.py", line
>>>>> 194, in send_preamble
>>>>>     'Date: %s\r\n' % format_date_time(time.time())
>>>>>   File "/usr/local/src/python/lib/python2.7/socket.py", line 324, in
>>>>> write
>>>>>     self.flush()
>>>>>   File "/usr/local/src/python/lib/python2.7/socket.py", line 303, in
>>>>> flush
>>>>>     self._sock.sendall(view[write_offset:write_offset+buffer_size])
>>>>> error: [Errno 32] Broken pipe
>>>>> 129.252.139.68 - - [28/Oct/2014 08:38:15] "POST / HTTP/1.1" 500 59
>>>>> ----------------------------------------
>>>>> Exception happened during processing of request from
>>>>> ('129.252.139.68',
>>>>> 51289)
>>>>> Traceback (most recent call last):
>>>>>   File "/usr/local/src/python/lib/python2.7/SocketServer.py", line
>>>>> 295,
>>>>> in
>>>>> _handle_request_noblock
>>>>>     self.process_request(request, client_address)
>>>>>   File "/usr/local/src/python/lib/python2.7/SocketServer.py", line
>>>>> 321,
>>>>> in
>>>>> process_request
>>>>>     self.finish_request(request, client_address)
>>>>>   File "/usr/local/src/python/lib/python2.7/SocketServer.py", line
>>>>> 334,
>>>>> in
>>>>> finish_request
>>>>>     self.RequestHandlerClass(request, client_address, self)
>>>>>   File "/usr/local/src/python/lib/python2.7/SocketServer.py", line
>>>>> 653,
>>>>> in
>>>>> __init__
>>>>>     self.finish()
>>>>>   File "/usr/local/src/python/lib/python2.7/SocketServer.py", line
>>>>> 712,
>>>>> in
>>>>> finish
>>>>>     self.wfile.close()
>>>>>   File "/usr/local/src/python/lib/python2.7/socket.py", line 279, in
>>>>> close
>>>>>     self.flush()
>>>>>   File "/usr/local/src/python/lib/python2.7/socket.py", line 303, in
>>>>> flush
>>>>>     self._sock.sendall(view[write_offset:write_offset+buffer_size])
>>>>> error: [Errno 32] Broken pipe
>>>>> ----------------------------------------
>>>>>
>>>>> and finally in the log:
>>>>> file=/home/madrona/src/pycsw/pycsw/server.py line=2331 module=server
>>>>> function=_write_response Response:
>>>>> <ows:ExceptionReport xmlns:dc="http://purl.org/dc/elements/1.1/"
>>>>> xmlns:inspire_common="http://inspire.ec.europa.eu/schemas/common/1.0"
>>>>> xmlns:atom="http://www.w3.org/2005/Atom"
>>>>> xmlns:xs="http://www.w3.org/2001/XMLSchema"
>>>>> xmlns:dct="http://purl.org/dc/terms/"
>>>>> xmlns:ows="http://www.opengis.net/ows"
>>>>> xmlns:apiso="http://www.opengis.net/cat/csw/apiso/1.0"
>>>>> xmlns:gml="http://www.opengis.net/gml"
>>>>> xmlns:dif="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"
>>>>> xmlns:xlink="http://www.w3.org/1999/xlink"
>>>>> xmlns:gco="http://www.isotc211.org/2005/gco"
>>>>> xmlns:gmd="http://www.isotc211.org/2005/gmd"
>>>>> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>>>>> xmlns:srv="http://www.isotc211.org/2005/srv"
>>>>> xmlns:ogc="http://www.opengis.net/ogc"
>>>>> xmlns:fgdc="http://www.opengis.net/cat/csw/csdgm"
>>>>> xmlns:inspire_ds="http://inspire.ec.europa.eu/schemas/inspire_ds/1.0"
>>>>> xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
>>>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>> xmlns:os="http://a9.com/-/spec/opensearch/1.1/"
>>>>> xmlns:soapenv="http://www.w3.org/2003/05/soap-envelope"
>>>>> xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9"
>>>>> language="en-US" version="1.2.0"
>>>>> xsi:schemaLocation="http://www.opengis.net/ows
>>>>> http://schemas.opengis.net/ows/1.0.0/owsExceptionReport.xsd">
>>>>>   <ows:Exception exceptionCode="NoApplicableCode" locator="source">
>>>>>     <ows:ExceptionText>Harvest failed: record parsing failed:
>>>>> </ows:ExceptionText>
>>>>>   </ows:Exception>
>>>>> </ows:ExceptionReport>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> pycsw-devel mailing list
>>>>> pycsw-devel at lists.osgeo.org
>>>>> http://lists.osgeo.org/mailman/listinfo/pycsw-devel
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>


More information about the pycsw-devel mailing list