[pycsw-devel] help on finding a way to propose some changes to pycsw

Fri Jul 17 07:27:32 PDT 2015

Hi Ricardo: thanks very much for the deep deep investigation and analysis
to the codebase -- wow!

I can't speak for all pycsw devs but I will say there is certainly alot of room
for improvement in pycsw and contributions are always welcome.  For the most
part the (very nice!) enhancements / refactoring you outline make sense.

At this point I would suggest submitting the changes as a series of
granular, logical GitHub pull requests (as you have done so far).  This will
allow devs to do code review in a manageable fashion, and be able to track
down unit testing issues against travis-ci.

For deeper impact changes like using native datetime types I would recommend
working with a pycsw dev to draft an RFC.

Looking forward to these valuable contributions!

Thanks

..Tom

On Thu, 16 Jul 2015, Ricardo Filipe Soares Garcia da wrote:

> Date: Thu, 16 Jul 2015 12:11:26 +0100
> From: Ricardo Filipe Soares Garcia da <ricardo.garcia.silva at gmail.com>
> To: pycsw-devel at lists.osgeo.org
> Subject: [pycsw-devel] help on finding a way to propose some changes to pycsw
> 
> Hi PSC, all
>
> (Sorry for this very long post,
>
> TL;DR: I've been making some heavy changes to pycsw's codebase and would
> like to find a nice way for them to be reviewed and possibly mainlined
>
> )
>
>
> I have recently began working with pycsw and am keen to contribute to the
> project.
>
> My motivation:
>
> I am working at the portuguese sea and atmosphere institute, for a project
> of the Global Land component of the Copernicus program. We produce
> earth observation products based on the processing of satellite imagery.
> Our production center is working 24/7 and we have new products every hour.
> Each of these products gets also an INSPIRE compliant XML metadata record
> that is fed into our CSW catalogue and is available for searching to our
> project's partners. Being a Python developer, I am really interested in
> moving our current CSW solution to a Python codebase. PyCSW seems like the
> perfect fit, with its outstanding standards compliancy record.
>
> Recently I opened an issue on github[1] asking for a change in PyCSW's
> database
> schema. I wanted to switch over to using native datetime types in instead of
> plain strings. Eventually, Angelos replied that such a change is not trivial
> because it would mean breaking compatibility with older versions and
> probably
> some upgrade problems to current deployments (a very wise opinion indeed).
> He
> also reccommended that I'd bring the discussion over to the mailing list.
>
> [1] - https://github.com/geopython/pycsw/issues/365
>
> Before doing so, I decided to take a more thorough look into pycsw's
> codebase
> in order to get a better notion of the impact of the proposed change.
>
> As I began my 'pycsw innards code tour', I started spotting some corners
> where
> there seemed to be oportunities to 'improve' the code (not that it is bad
> per
> se, I mean to make it more readable, easier to maintain and test, ect). I
> could not help immediately starting to hack into it and refactor some of
> this
> stuff. In my frenzy, I ended up changing a lot of stuff, and it is not even
> properly done yet. However, I feel like the changes are really beneficial
> to
> pycsw, so I'd like them to be merged (once they are ready).
>
> I'd like to get some assistance from the PSC in order to figure out a way to
> propose these changes, so that they could be properly reviewed and
> accepted/rejected.
>
> Currently, all these changes are piled up in the refactor-setupdb branch in
> my
> fork of pycsw at:
>
> https://github.com/ricardogsilva/pycsw/tree/refactor-setupdb
>
>
> Here is an overview on what I have been working on since last week:
>
> * Switched to using sqlalchemy's declarative extension. This allows a
> number
>  of simplifications to be made in pycsw's codebase:
>
>  * The definition of repository tables is now on a separate
>    `pycsw.core.models` module, making it easier to find. Previously, the
>    definition of the tables was done using sqlalchemy's classical mappings
>    and was present in the `pycsw.core.admin` module
>
>  * Having an explicitly defined class for metadata records allows us to
>    reference it directly in the code whenever there is a need to query the
>    database. This means we no longer need to use reflection of the database
>    schema whenever we need to work with a record. We can call
>    `from pycsw.core.models import Record` and then use the `Record` class
>
>  * The declarative extension allows us to remap the names of the database
>    table and columns on the fly without changing the names of the
> attributes
>    on the mapped class. This means that:
>
>    * The user can still define custom database mappings in the same way as
>      before
>
>    * Internally, pycsw does not need to care about the actual names of the
>      database columns because the names of the attributes on the `Record`
>      class remain the same.
>
>   The benefit of this change is that we no longer need to be converting
> from
>   `pycsw:Identifier` (and the other mappings) all the time, we can use
>   `record.identifier`. Whenever the user changes the mappings, a quick
> call to
>   `context.update_mappings` takes care of the underlying changes in
>   sqlalchemy's communication with the database.
>
>   This change also opens up a number of interesting possibilities that we
>   may evaluate in the future. One such possibility is adding support for
>   changing the database schema using the concept of migrations. The
>   sqlalchemy author also maintains the alembic project that deals with
> this.
>   If at some point pycsw chooses to alter the schema of the database, it
>   should become easier to upgrade existing deployments by providing a clear
>   migration path and some scripts to automate the process.
>
> * Refactored the metadata parsing code to take advantage of the new
>  models.Record class.
>
> * Extended the repository concept to deal with everything related with the
>  database.
>
>  Currently there exists the `pycsw.core.repository` module that works with
>  an already existing database and the `pycsw.core.admin` which features the
>  `setup_db` function to take care of database initialization.
>
>  The proposed solution refactors the code into the
> `pycsw.core.repositories`
>  package. There is now a base class
> `pycsw.core.repositories.base.Repository`
>  that performs both tasks of initializing the database and interacting
>  with it. The base class is then extended into several subclasses that deal
>  mostly with the initialization of each particular backend.
>
> * Unified configuration settings and initialization tasks into a common
> object
>
>  The current approach is to use `pycsw.core.config.StaticContext` as an
>  aggregator of several settings that are used by other parts of the
> codebase.
>  However, since this class is not used to store the user-defined settings,
>  there seems to be an opportunity to take the aggregation concept a step
>  further
>
>  The proposed solution implements the `pycsw.core.configuration` package.
> This
>  package includes the `configuration.Context` class that is used to group
>  together in a single entity:
>
>  * pycsw's internal data model and settings
>  * user defined settings
>  * the current repository
>  * the current database mappings
>
>  The Context class becomes central to pycsw's workflow, since it stores
>  references to all runtime and static resources.
>
>  It can be initialized with a number of different resources:
>
>  * A .cfg file that is read with python's SafeConfigParser, just as before
>  * An in-memory SafeConfigParser object or dict object (that could have
>    been passed by the cgi, wsgi or other wrapper), just as before
>  * A .json file that is parsed into a dict with python's json module
>
>  Another nice effect of this change is that the user settings are now
> stored
>  internally as class attributes and not as section/option pairs in a
>  SafeconfigParser. This allows using python types such as booleans, lists,
>  etc in the runtime settings, which simplifies the parsing of these values
>
>  The context uses default values for all settings, which makes it possible
> to
>  only override some of them and still have working code. For example, one
>  could setup a context for testing using an in-memory sqlite database by
>  calling:
>
>  .. code:: python
>
>     from pycsw.core.configuration.configuration import Context
>     from pycsw.core.models import Record
>     from pycsw.core.etree import etree
>
>     ctx = Context({"database": "sqlite:///:memory:"})
>     ctx.repository.setup_db()
>     exml = etree.parse("some_xml_record")
>     ctx.repository.insert_record(exml)
>     ctx.repository.query(Record).filter(Record.identifier==1234)
>
> Now that I've came to my senses I realize that it was a really bad idea to
> tackle so much stuff at once, since it makes it harder to stabilize the code
> and also its harder for someone else to sift through changes. I apologize
> for
> that.
>
> I will focus on finishing up these changes in order to get to a working
> state
> where all previous functionality is restored and will not change anything
> else,
> apart from the essential refactor to make the rest of the code work with
> the
> changes.
>
> I guess I need help in order to come up with a way to propose these changes.
> Should I prepare an RFC with all the changes together? Should I try to
> isolate
> them and propose each as an enhancement? Something else?
>
> Thanks a lot and sorry for the trouble ;)
>
>
>
>
> -- 
> ___________________________ ___ __
> Ricardo Garcia Silva
>