[pgrouting-dev] GSoC partitioning project

Fri Jul 5 07:19:59 PDT 2013

Mukul asks:
> can you help me coming up with basic class skeletons for retrieving
> partitions using an index key  and for keeping a track of loaded
> partitions as well .

The problem:

* we have an incomplete graph to solve
* we need to find a node in the graph but it is not in the graph
   * we call a method graph::LoadPartition(partition_id) to load the 
partition with that node
   * if this fails then the route fails and we clean up, and throw an error
* otherwise we continue solving the graph until we need to load another 
partition

This assumes it is cheap to find a node or to find that it is missing.
But the question is how do we define "missing".

A node has two pieces of information, the node_id and the partition_id. 
We can can make some assumptions here that if both ends of an edge are 
in the same partition then we will never find it in the graph unless the 
partition has been loaded. So edges that cross partition boundaries are 
the ones that concern us because the adjacent partition may or may not 
have been loaded.

So we might define an edge like:

eid   -- edge id
sid   -- source node id
spid  -- source partition id
tid   -- target node id
tpid  -- target partition id
cost
rcost
...

We could set both spid and tpid to zero if the edge is not a boundary 
edge. So a simple check if either of these is zero then we know both are 
loaded. If they are not zero then we need to check if the partition is 
loaded and if not call graph::LoadPartiton(partition_id).

I would recommend maintaining a bit array where the partition_id >0 is 
the index into the bit array.

How does graph::LoadPartiton(partition_id) work?

After we partition the data we have a tables like:

CREATE TABLE grid_vertices
(
   id integer NOT NULL,   -- node id
   pid integer,           -- partition id
   the_geom geometry,
   quadkey text,
   CONSTRAINT grid_vertices_pkey PRIMARY KEY (id)
);

The following is the edge table and it can have other columns as need or 
it can be a view with the source, target, sgid, and tgid columns joined 
from another table.

CREATE TABLE grid
(
   gid serial NOT NULL,
   source integer,         -- source node_id
   target integer,         -- target node_id
   cost double precision,
   sgid integer,           -- source grid_id (partition_id)
   tgid integer,           -- target grid_id (partition_id)
   CONSTRAINT grid_pkey PRIMARY KEY (gid)
);

So the LoadPartition() method will call a C function that will run a 
query using the postgresql SPI functions, build a struct array and 
return the array pointer to the method that will load the array into the 
graph and update the bit array that tracks the loaded partitions. Then 
free the struct array and return ok or fail.

The C code will execute a query to select all the edges that have at 
least one end in that partition:

    select * from grid where sgid=<pid>
    union
    select * from grid where tgid=<pid>

both sgid and tgid columns should be indexed on the table.

The struct will be something like:

typedef struct partition_t {
   integer gid;
   integer source;
   integer spart;
   integer target;
   integer tpart
   double precision cost;
   double precision rcost;
   ...
}

You will need to return the edge count for the partition_t[] array or 
add a dummy record with gid = -1 to signal the end of the array.

When the method loads the edges, if will zero the partition_ids if both 
are the same. It will need to check if they are different, and check if 
the other partition is already load if that edge as already been loaded 
into the graph to avoid duplication of the edges for these boundary edges.

Does this cover everything?
Do you need some help with the C code? and processing the SQL query?

-Steve