[pycsw-devel] help on finding a way to propose some changes to pycsw

Thu Jul 16 04:11:26 PDT 2015

Hi PSC, all

(Sorry for this very long post,

TL;DR: I've been making some heavy changes to pycsw's codebase and would
like to find a nice way for them to be reviewed and possibly mainlined

)

I have recently began working with pycsw and am keen to contribute to the
project.

My motivation:

I am working at the portuguese sea and atmosphere institute, for a project
of the Global Land component of the Copernicus program. We produce
earth observation products based on the processing of satellite imagery.
Our production center is working 24/7 and we have new products every hour.
Each of these products gets also an INSPIRE compliant XML metadata record
that is fed into our CSW catalogue and is available for searching to our
project's partners. Being a Python developer, I am really interested in
moving our current CSW solution to a Python codebase. PyCSW seems like the
perfect fit, with its outstanding standards compliancy record.

Recently I opened an issue on github[1] asking for a change in PyCSW's
database
schema. I wanted to switch over to using native datetime types in instead of
plain strings. Eventually, Angelos replied that such a change is not trivial
because it would mean breaking compatibility with older versions and
probably
some upgrade problems to current deployments (a very wise opinion indeed).
He
also reccommended that I'd bring the discussion over to the mailing list.

[1] - https://github.com/geopython/pycsw/issues/365

Before doing so, I decided to take a more thorough look into pycsw's
codebase
in order to get a better notion of the impact of the proposed change.

As I began my 'pycsw innards code tour', I started spotting some corners
where
there seemed to be oportunities to 'improve' the code (not that it is bad
per
se, I mean to make it more readable, easier to maintain and test, ect). I
could not help immediately starting to hack into it and refactor some of
this
stuff. In my frenzy, I ended up changing a lot of stuff, and it is not even
properly done yet. However, I feel like the changes are really beneficial
to
pycsw, so I'd like them to be merged (once they are ready).

I'd like to get some assistance from the PSC in order to figure out a way to
propose these changes, so that they could be properly reviewed and
accepted/rejected.

Currently, all these changes are piled up in the refactor-setupdb branch in
my
fork of pycsw at:

https://github.com/ricardogsilva/pycsw/tree/refactor-setupdb

Here is an overview on what I have been working on since last week:

* Switched to using sqlalchemy's declarative extension. This allows a
number
  of simplifications to be made in pycsw's codebase:

  * The definition of repository tables is now on a separate
    `pycsw.core.models` module, making it easier to find. Previously, the
    definition of the tables was done using sqlalchemy's classical mappings
    and was present in the `pycsw.core.admin` module

  * Having an explicitly defined class for metadata records allows us to
    reference it directly in the code whenever there is a need to query the
    database. This means we no longer need to use reflection of the database
    schema whenever we need to work with a record. We can call
    `from pycsw.core.models import Record` and then use the `Record` class

  * The declarative extension allows us to remap the names of the database
    table and columns on the fly without changing the names of the
attributes
    on the mapped class. This means that:

    * The user can still define custom database mappings in the same way as
      before

    * Internally, pycsw does not need to care about the actual names of the
      database columns because the names of the attributes on the `Record`
      class remain the same.

   The benefit of this change is that we no longer need to be converting
from
   `pycsw:Identifier` (and the other mappings) all the time, we can use
   `record.identifier`. Whenever the user changes the mappings, a quick
call to
   `context.update_mappings` takes care of the underlying changes in
   sqlalchemy's communication with the database.

   This change also opens up a number of interesting possibilities that we
   may evaluate in the future. One such possibility is adding support for
   changing the database schema using the concept of migrations. The
   sqlalchemy author also maintains the alembic project that deals with
this.
   If at some point pycsw chooses to alter the schema of the database, it
   should become easier to upgrade existing deployments by providing a clear
   migration path and some scripts to automate the process.

* Refactored the metadata parsing code to take advantage of the new
  models.Record class.

* Extended the repository concept to deal with everything related with the
  database.

  Currently there exists the `pycsw.core.repository` module that works with
  an already existing database and the `pycsw.core.admin` which features the
  `setup_db` function to take care of database initialization.

  The proposed solution refactors the code into the
`pycsw.core.repositories`
  package. There is now a base class
`pycsw.core.repositories.base.Repository`
  that performs both tasks of initializing the database and interacting
  with it. The base class is then extended into several subclasses that deal
  mostly with the initialization of each particular backend.

* Unified configuration settings and initialization tasks into a common
object

  The current approach is to use `pycsw.core.config.StaticContext` as an
  aggregator of several settings that are used by other parts of the
codebase.
  However, since this class is not used to store the user-defined settings,
  there seems to be an opportunity to take the aggregation concept a step
  further

  The proposed solution implements the `pycsw.core.configuration` package.
This
  package includes the `configuration.Context` class that is used to group
  together in a single entity:

  * pycsw's internal data model and settings
  * user defined settings
  * the current repository
  * the current database mappings

  The Context class becomes central to pycsw's workflow, since it stores
  references to all runtime and static resources.

  It can be initialized with a number of different resources:

  * A .cfg file that is read with python's SafeConfigParser, just as before
  * An in-memory SafeConfigParser object or dict object (that could have
    been passed by the cgi, wsgi or other wrapper), just as before
  * A .json file that is parsed into a dict with python's json module

  Another nice effect of this change is that the user settings are now
stored
  internally as class attributes and not as section/option pairs in a
  SafeconfigParser. This allows using python types such as booleans, lists,
  etc in the runtime settings, which simplifies the parsing of these values

  The context uses default values for all settings, which makes it possible
to
  only override some of them and still have working code. For example, one
  could setup a context for testing using an in-memory sqlite database by
  calling:

  .. code:: python

     from pycsw.core.configuration.configuration import Context
     from pycsw.core.models import Record
     from pycsw.core.etree import etree

     ctx = Context({"database": "sqlite:///:memory:"})
     ctx.repository.setup_db()
     exml = etree.parse("some_xml_record")
     ctx.repository.insert_record(exml)
     ctx.repository.query(Record).filter(Record.identifier==1234)

Now that I've came to my senses I realize that it was a really bad idea to
tackle so much stuff at once, since it makes it harder to stabilize the code
and also its harder for someone else to sift through changes. I apologize
for
that.

I will focus on finishing up these changes in order to get to a working
state
where all previous functionality is restored and will not change anything
else,
apart from the essential refactor to make the rest of the code work with
the
changes.

I guess I need help in order to come up with a way to propose these changes.
Should I prepare an RFC with all the changes together? Should I try to
isolate
them and propose each as an enhancement? Something else?

Thanks a lot and sorry for the trouble ;)

-- 
___________________________ ___ __
Ricardo Garcia Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pycsw-devel/attachments/20150716/db532985/attachment.html>