[pycsw-devel] help on finding a way to propose some changes to pycsw

Angelos Tzotsos gcpp.kalxas at gmail.com
Sat Jul 18 03:22:20 PDT 2015

Hi Ricardo,

Indeed very impressive work here. Thank you.

I agree with Tom regarding the series of pull requests. We need to 
understand the changes in depth and make sure the code base remains 
stable in the process. We will help you writing the necessary RFCs, 
especially for the deep changes.

Thanks again,

On 07/17/2015 05:27 PM, Tom Kralidis wrote:
> Hi Ricardo: thanks very much for the deep deep investigation and analysis
> to the codebase -- wow!
> I can't speak for all pycsw devs but I will say there is certainly 
> alot of room
> for improvement in pycsw and contributions are always welcome. For the 
> most
> part the (very nice!) enhancements / refactoring you outline make sense.
> At this point I would suggest submitting the changes as a series of
> granular, logical GitHub pull requests (as you have done so far). This 
> will
> allow devs to do code review in a manageable fashion, and be able to 
> track
> down unit testing issues against travis-ci.
> For deeper impact changes like using native datetime types I would 
> recommend
> working with a pycsw dev to draft an RFC.
> Looking forward to these valuable contributions!
> Thanks
> ..Tom
> On Thu, 16 Jul 2015, Ricardo Filipe Soares Garcia da wrote:
>> Date: Thu, 16 Jul 2015 12:11:26 +0100
>> From: Ricardo Filipe Soares Garcia da <ricardo.garcia.silva at gmail.com>
>> To: pycsw-devel at lists.osgeo.org
>> Subject: [pycsw-devel] help on finding a way to propose some changes 
>> to pycsw
>> Hi PSC, all
>> (Sorry for this very long post,
>> TL;DR: I've been making some heavy changes to pycsw's codebase and would
>> like to find a nice way for them to be reviewed and possibly mainlined
>> )
>> I have recently began working with pycsw and am keen to contribute to 
>> the
>> project.
>> My motivation:
>> I am working at the portuguese sea and atmosphere institute, for a 
>> project
>> of the Global Land component of the Copernicus program. We produce
>> earth observation products based on the processing of satellite imagery.
>> Our production center is working 24/7 and we have new products every 
>> hour.
>> Each of these products gets also an INSPIRE compliant XML metadata 
>> record
>> that is fed into our CSW catalogue and is available for searching to our
>> project's partners. Being a Python developer, I am really interested in
>> moving our current CSW solution to a Python codebase. PyCSW seems 
>> like the
>> perfect fit, with its outstanding standards compliancy record.
>> Recently I opened an issue on github[1] asking for a change in PyCSW's
>> database
>> schema. I wanted to switch over to using native datetime types in 
>> instead of
>> plain strings. Eventually, Angelos replied that such a change is not 
>> trivial
>> because it would mean breaking compatibility with older versions and
>> probably
>> some upgrade problems to current deployments (a very wise opinion 
>> indeed).
>> He
>> also reccommended that I'd bring the discussion over to the mailing 
>> list.
>> [1] - https://github.com/geopython/pycsw/issues/365
>> Before doing so, I decided to take a more thorough look into pycsw's
>> codebase
>> in order to get a better notion of the impact of the proposed change.
>> As I began my 'pycsw innards code tour', I started spotting some corners
>> where
>> there seemed to be oportunities to 'improve' the code (not that it is 
>> bad
>> per
>> se, I mean to make it more readable, easier to maintain and test, 
>> ect). I
>> could not help immediately starting to hack into it and refactor some of
>> this
>> stuff. In my frenzy, I ended up changing a lot of stuff, and it is 
>> not even
>> properly done yet. However, I feel like the changes are really 
>> beneficial
>> to
>> pycsw, so I'd like them to be merged (once they are ready).
>> I'd like to get some assistance from the PSC in order to figure out a 
>> way to
>> propose these changes, so that they could be properly reviewed and
>> accepted/rejected.
>> Currently, all these changes are piled up in the refactor-setupdb 
>> branch in
>> my
>> fork of pycsw at:
>> https://github.com/ricardogsilva/pycsw/tree/refactor-setupdb
>> Here is an overview on what I have been working on since last week:
>> * Switched to using sqlalchemy's declarative extension. This allows a
>> number
>>  of simplifications to be made in pycsw's codebase:
>>  * The definition of repository tables is now on a separate
>>    `pycsw.core.models` module, making it easier to find. Previously, the
>>    definition of the tables was done using sqlalchemy's classical 
>> mappings
>>    and was present in the `pycsw.core.admin` module
>>  * Having an explicitly defined class for metadata records allows us to
>>    reference it directly in the code whenever there is a need to 
>> query the
>>    database. This means we no longer need to use reflection of the 
>> database
>>    schema whenever we need to work with a record. We can call
>>    `from pycsw.core.models import Record` and then use the `Record` 
>> class
>>  * The declarative extension allows us to remap the names of the 
>> database
>>    table and columns on the fly without changing the names of the
>> attributes
>>    on the mapped class. This means that:
>>    * The user can still define custom database mappings in the same 
>> way as
>>      before
>>    * Internally, pycsw does not need to care about the actual names 
>> of the
>>      database columns because the names of the attributes on the 
>> `Record`
>>      class remain the same.
>>   The benefit of this change is that we no longer need to be converting
>> from
>>   `pycsw:Identifier` (and the other mappings) all the time, we can use
>>   `record.identifier`. Whenever the user changes the mappings, a quick
>> call to
>>   `context.update_mappings` takes care of the underlying changes in
>>   sqlalchemy's communication with the database.
>>   This change also opens up a number of interesting possibilities 
>> that we
>>   may evaluate in the future. One such possibility is adding support for
>>   changing the database schema using the concept of migrations. The
>>   sqlalchemy author also maintains the alembic project that deals with
>> this.
>>   If at some point pycsw chooses to alter the schema of the database, it
>>   should become easier to upgrade existing deployments by providing a 
>> clear
>>   migration path and some scripts to automate the process.
>> * Refactored the metadata parsing code to take advantage of the new
>>  models.Record class.
>> * Extended the repository concept to deal with everything related 
>> with the
>>  database.
>>  Currently there exists the `pycsw.core.repository` module that works 
>> with
>>  an already existing database and the `pycsw.core.admin` which 
>> features the
>>  `setup_db` function to take care of database initialization.
>>  The proposed solution refactors the code into the
>> `pycsw.core.repositories`
>>  package. There is now a base class
>> `pycsw.core.repositories.base.Repository`
>>  that performs both tasks of initializing the database and interacting
>>  with it. The base class is then extended into several subclasses 
>> that deal
>>  mostly with the initialization of each particular backend.
>> * Unified configuration settings and initialization tasks into a common
>> object
>>  The current approach is to use `pycsw.core.config.StaticContext` as an
>>  aggregator of several settings that are used by other parts of the
>> codebase.
>>  However, since this class is not used to store the user-defined 
>> settings,
>>  there seems to be an opportunity to take the aggregation concept a step
>>  further
>>  The proposed solution implements the `pycsw.core.configuration` 
>> package.
>> This
>>  package includes the `configuration.Context` class that is used to 
>> group
>>  together in a single entity:
>>  * pycsw's internal data model and settings
>>  * user defined settings
>>  * the current repository
>>  * the current database mappings
>>  The Context class becomes central to pycsw's workflow, since it stores
>>  references to all runtime and static resources.
>>  It can be initialized with a number of different resources:
>>  * A .cfg file that is read with python's SafeConfigParser, just as 
>> before
>>  * An in-memory SafeConfigParser object or dict object (that could have
>>    been passed by the cgi, wsgi or other wrapper), just as before
>>  * A .json file that is parsed into a dict with python's json module
>>  Another nice effect of this change is that the user settings are now
>> stored
>>  internally as class attributes and not as section/option pairs in a
>>  SafeconfigParser. This allows using python types such as booleans, 
>> lists,
>>  etc in the runtime settings, which simplifies the parsing of these 
>> values
>>  The context uses default values for all settings, which makes it 
>> possible
>> to
>>  only override some of them and still have working code. For example, 
>> one
>>  could setup a context for testing using an in-memory sqlite database by
>>  calling:
>>  .. code:: python
>>     from pycsw.core.configuration.configuration import Context
>>     from pycsw.core.models import Record
>>     from pycsw.core.etree import etree
>>     ctx = Context({"database": "sqlite:///:memory:"})
>>     ctx.repository.setup_db()
>>     exml = etree.parse("some_xml_record")
>>     ctx.repository.insert_record(exml)
>>     ctx.repository.query(Record).filter(Record.identifier==1234)
>> Now that I've came to my senses I realize that it was a really bad 
>> idea to
>> tackle so much stuff at once, since it makes it harder to stabilize 
>> the code
>> and also its harder for someone else to sift through changes. I 
>> apologize
>> for
>> that.
>> I will focus on finishing up these changes in order to get to a working
>> state
>> where all previous functionality is restored and will not change 
>> anything
>> else,
>> apart from the essential refactor to make the rest of the code work with
>> the
>> changes.
>> I guess I need help in order to come up with a way to propose these 
>> changes.
>> Should I prepare an RFC with all the changes together? Should I try to
>> isolate
>> them and propose each as an enhancement? Something else?
>> Thanks a lot and sorry for the trouble ;)
>> -- 
>> ___________________________ ___ __
>> Ricardo Garcia Silva
> _______________________________________________
> pycsw-devel mailing list
> pycsw-devel at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/pycsw-devel

Angelos Tzotsos, PhD
OSGeo Charter Member

More information about the pycsw-devel mailing list