Thanks for the reply and suggestions.<br><br>Well I have now run the same script but with just one thread/process. I would have expected this to have worked if it was a "bulk" out of memory problem (only one pgRouting process running). It failed. Also with better diagnostics of my I own, I tried to recreate the SQL statements on the command line - unfortunately these worked!<br>
<br>I have been using the System Monitor for a while. Previously this showed it hitting swap memory occasionally , so I've bumped the machine memory from 4GB to 8GB, and it hasn't done since.<br>(Yes this is running 32 bit mainly because PostGIS is 32 bit, but I understand modern Linux has a way to handle more memory (but limited per process) - and it was using the full 4GB). I do note that it does not appear to have gone beyond a full 4GB (+ephemera) memory usage.<br>
<br>I tried to adjust the shared memory parameter in PostGres but I think the default must be close to the maximum for standard Ubuntu (something about having to rebuild the kernel to change SHMEM). So the PostGres shared memory setting is back to its default (28MB). work_mem has been upped to 256MB. This change was after the first crash.<br>
<br>Otherwise it is difficult to watch with top or the system monitor because so far it has had to run a while (hours) before the crash occurs.<br><br><br>I guess as a kludgy workaround I could try trapping the client error, wait, and skip (or try again). This should work for a single thread, but might pose problems for my multi-threaded app. That's the problem when the server dies - all client threads have trouble until it restarts.<br>
<br><br>Richard Marsden<br><br><br><br><br><div class="gmail_quote">On Mon, Feb 28, 2011 at 4:43 PM, Stephen Woodbridge <span dir="ltr"><<a href="mailto:woodbri@swoodbridge.com">woodbri@swoodbridge.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">The only think(s) that I can think of are:<br>
<br>
1. it could be caused by a call to abort() or assert() in the C code, but:<br>
<br>
woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec grep -l -i abort {} \;<br>
woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec grep -l -i assert {} \;<br>
core/src/CMakeFiles/routing.dir/depend.make<br>
core/src/CMakeFiles/routing.dir/astar_boost_wrapper.o<br>
core/src/CMakeFiles/routing.dir/shooting_star_boost_wrapper.o<br>
core/src/CMakeFiles/routing.dir/depend.internal<br>
core/src/CMakeFiles/routing.dir/boost_wrapper.o<br>
core/src/CMakeFiles/routing.dir/CXX.includecache<br>
lib/librouting.so<br>
<br>
So it does not look like we have one in our source code, but there appears to be references in the .o that might be referenced by compiler generated code or includes outside our source tree like boost or system libs.<br>
<br>
2. I suppose it is possible that the server is sending a SIGABRT to a child process that is doing something bad like taking too much memory. Or maybe there is an OOM (Out Of Memory) watchdog process killing it with a SIGABRT.<br>
<br>
Have you watched this with top? or some other process watcher?<br>
<br>
Hopefully, you can extract the SQL and run it from the command line so we can get a better hand on what is happening and what the query is.<br>
<br>
-Steve<div><div></div><div class="h5"><br>
<br>
<br>
On 2/28/2011 2:49 PM, Richard Marsden wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div><div></div><div class="h5">
Well I've moved forward and now have code in production calculating<br>
mileages from OpenStreetMap data: I've calculated mileage charts for<br>
Oceania and Africa.<br>
The secret to get that far was to move operating systems from Windows to<br>
Ubuntu and then upgrade to pgRouting 1.05. (PostGres 8.4, Ubuntu 10)<br>
Th computations are being performed with dijkstra_sp_delta.<br>
<br>
However now I'm hitting another "server closed the connection<br>
unexpectedly" error.<br>
<br>
Looking in the server logs, I find the LOG message "server process (PID<br>
19133) was terminated by signal 6: Aborted"<br>
>From what I can tell, Signal 6 on Ubuntu is indeed a SIGABRT. There<br>
are no other log messages to indicate why Postgres/pgRouting threw a<br>
SIGABRT.<br>
<br>
This is then followed by warnings and log messages saying other active<br>
server processes are being terminated, transactions rolled back, etc.<br>
<br>
This error occurs at a reproducible point in a fairly sophisticated<br>
(multi-processor, Python, psycopg) script. Although I'm pretty certain<br>
of the SQL that is causing the problem, at the moment I don't have the<br>
exact parameters (ie. graph nodes). I'm about to run the script<br>
single-threaded with diagnostics so I should be able to get a single SQL<br>
statement to reproduce the problem on a psql command line. In the worst<br>
case, this could take a couple of days.<br>
No other programs are running that are calling Postgres.<br>
<br>
My graph consists of the global OSM street data loaded into PostGIS with<br>
osm2po. I have checked for links of zero length. In fact all links <1m<br>
long have been taken out of the graph. I've just double checked costs<br>
and reverse_costs: all are positive (I've set these to the lengths)<br>
<br>
I've just checked for start & end nodes being the same (ie. resulting in<br>
dijkstra_sp_delta being called with the some node identifier for the<br>
start and end): Yes my data has a few of these, but I'm pretty certain<br>
the crash occurs before they appear. However, I'm going to add code to<br>
detect these - there's no point in executing an SQL statement for<br>
something that can be calculated in a trivial line of python.<br>
<br>
What else should I be looking for? Are there any known problems I should<br>
look for? Is there any way of finding out what is causing the Signal 6?<br>
<br>
Once I have the node identifiers that are causing the problem, I should<br>
be able to make en exportable-extract of the graph to give a<br>
reproducible dataset and matching SQL statement. Would anyone be able to<br>
investigate this?<br>
<br>
Is there any way of making pgRouting / PostGres handle these situations<br>
more cleanly? At the moment, the crash is taking the server down with<br>
it. The crash is perhaps the first to occur after roughly 1 million<br>
route calculations: I can live with that failure rate - but only if my<br>
scripts can cleanly detect and recover from it. I guess ideally the<br>
server should stay up and a status value (or exception - but that<br>
probably wouldn't work across so many code boundaries) be returned.<br>
<br>
<br>
<br>
Best regards,<br>
<br>
<br>
Richard Marsden<br></div></div></blockquote></blockquote></div><br>