[pycsw-devel] help on finding a way to propose some changes to pycsw
Tom Kralidis
tomkralidis at gmail.com
Fri Jul 17 07:27:32 PDT 2015
Hi Ricardo: thanks very much for the deep deep investigation and analysis
to the codebase -- wow!
I can't speak for all pycsw devs but I will say there is certainly alot of room
for improvement in pycsw and contributions are always welcome. For the most
part the (very nice!) enhancements / refactoring you outline make sense.
At this point I would suggest submitting the changes as a series of
granular, logical GitHub pull requests (as you have done so far). This will
allow devs to do code review in a manageable fashion, and be able to track
down unit testing issues against travis-ci.
For deeper impact changes like using native datetime types I would recommend
working with a pycsw dev to draft an RFC.
Looking forward to these valuable contributions!
Thanks
..Tom
On Thu, 16 Jul 2015, Ricardo Filipe Soares Garcia da wrote:
> Date: Thu, 16 Jul 2015 12:11:26 +0100
> From: Ricardo Filipe Soares Garcia da <ricardo.garcia.silva at gmail.com>
> To: pycsw-devel at lists.osgeo.org
> Subject: [pycsw-devel] help on finding a way to propose some changes to pycsw
>
> Hi PSC, all
>
> (Sorry for this very long post,
>
> TL;DR: I've been making some heavy changes to pycsw's codebase and would
> like to find a nice way for them to be reviewed and possibly mainlined
>
> )
>
>
> I have recently began working with pycsw and am keen to contribute to the
> project.
>
> My motivation:
>
> I am working at the portuguese sea and atmosphere institute, for a project
> of the Global Land component of the Copernicus program. We produce
> earth observation products based on the processing of satellite imagery.
> Our production center is working 24/7 and we have new products every hour.
> Each of these products gets also an INSPIRE compliant XML metadata record
> that is fed into our CSW catalogue and is available for searching to our
> project's partners. Being a Python developer, I am really interested in
> moving our current CSW solution to a Python codebase. PyCSW seems like the
> perfect fit, with its outstanding standards compliancy record.
>
> Recently I opened an issue on github[1] asking for a change in PyCSW's
> database
> schema. I wanted to switch over to using native datetime types in instead of
> plain strings. Eventually, Angelos replied that such a change is not trivial
> because it would mean breaking compatibility with older versions and
> probably
> some upgrade problems to current deployments (a very wise opinion indeed).
> He
> also reccommended that I'd bring the discussion over to the mailing list.
>
> [1] - https://github.com/geopython/pycsw/issues/365
>
> Before doing so, I decided to take a more thorough look into pycsw's
> codebase
> in order to get a better notion of the impact of the proposed change.
>
> As I began my 'pycsw innards code tour', I started spotting some corners
> where
> there seemed to be oportunities to 'improve' the code (not that it is bad
> per
> se, I mean to make it more readable, easier to maintain and test, ect). I
> could not help immediately starting to hack into it and refactor some of
> this
> stuff. In my frenzy, I ended up changing a lot of stuff, and it is not even
> properly done yet. However, I feel like the changes are really beneficial
> to
> pycsw, so I'd like them to be merged (once they are ready).
>
> I'd like to get some assistance from the PSC in order to figure out a way to
> propose these changes, so that they could be properly reviewed and
> accepted/rejected.
>
> Currently, all these changes are piled up in the refactor-setupdb branch in
> my
> fork of pycsw at:
>
> https://github.com/ricardogsilva/pycsw/tree/refactor-setupdb
>
>
> Here is an overview on what I have been working on since last week:
>
> * Switched to using sqlalchemy's declarative extension. This allows a
> number
> of simplifications to be made in pycsw's codebase:
>
> * The definition of repository tables is now on a separate
> `pycsw.core.models` module, making it easier to find. Previously, the
> definition of the tables was done using sqlalchemy's classical mappings
> and was present in the `pycsw.core.admin` module
>
> * Having an explicitly defined class for metadata records allows us to
> reference it directly in the code whenever there is a need to query the
> database. This means we no longer need to use reflection of the database
> schema whenever we need to work with a record. We can call
> `from pycsw.core.models import Record` and then use the `Record` class
>
> * The declarative extension allows us to remap the names of the database
> table and columns on the fly without changing the names of the
> attributes
> on the mapped class. This means that:
>
> * The user can still define custom database mappings in the same way as
> before
>
> * Internally, pycsw does not need to care about the actual names of the
> database columns because the names of the attributes on the `Record`
> class remain the same.
>
> The benefit of this change is that we no longer need to be converting
> from
> `pycsw:Identifier` (and the other mappings) all the time, we can use
> `record.identifier`. Whenever the user changes the mappings, a quick
> call to
> `context.update_mappings` takes care of the underlying changes in
> sqlalchemy's communication with the database.
>
> This change also opens up a number of interesting possibilities that we
> may evaluate in the future. One such possibility is adding support for
> changing the database schema using the concept of migrations. The
> sqlalchemy author also maintains the alembic project that deals with
> this.
> If at some point pycsw chooses to alter the schema of the database, it
> should become easier to upgrade existing deployments by providing a clear
> migration path and some scripts to automate the process.
>
> * Refactored the metadata parsing code to take advantage of the new
> models.Record class.
>
> * Extended the repository concept to deal with everything related with the
> database.
>
> Currently there exists the `pycsw.core.repository` module that works with
> an already existing database and the `pycsw.core.admin` which features the
> `setup_db` function to take care of database initialization.
>
> The proposed solution refactors the code into the
> `pycsw.core.repositories`
> package. There is now a base class
> `pycsw.core.repositories.base.Repository`
> that performs both tasks of initializing the database and interacting
> with it. The base class is then extended into several subclasses that deal
> mostly with the initialization of each particular backend.
>
> * Unified configuration settings and initialization tasks into a common
> object
>
> The current approach is to use `pycsw.core.config.StaticContext` as an
> aggregator of several settings that are used by other parts of the
> codebase.
> However, since this class is not used to store the user-defined settings,
> there seems to be an opportunity to take the aggregation concept a step
> further
>
> The proposed solution implements the `pycsw.core.configuration` package.
> This
> package includes the `configuration.Context` class that is used to group
> together in a single entity:
>
> * pycsw's internal data model and settings
> * user defined settings
> * the current repository
> * the current database mappings
>
> The Context class becomes central to pycsw's workflow, since it stores
> references to all runtime and static resources.
>
> It can be initialized with a number of different resources:
>
> * A .cfg file that is read with python's SafeConfigParser, just as before
> * An in-memory SafeConfigParser object or dict object (that could have
> been passed by the cgi, wsgi or other wrapper), just as before
> * A .json file that is parsed into a dict with python's json module
>
> Another nice effect of this change is that the user settings are now
> stored
> internally as class attributes and not as section/option pairs in a
> SafeconfigParser. This allows using python types such as booleans, lists,
> etc in the runtime settings, which simplifies the parsing of these values
>
> The context uses default values for all settings, which makes it possible
> to
> only override some of them and still have working code. For example, one
> could setup a context for testing using an in-memory sqlite database by
> calling:
>
> .. code:: python
>
> from pycsw.core.configuration.configuration import Context
> from pycsw.core.models import Record
> from pycsw.core.etree import etree
>
> ctx = Context({"database": "sqlite:///:memory:"})
> ctx.repository.setup_db()
> exml = etree.parse("some_xml_record")
> ctx.repository.insert_record(exml)
> ctx.repository.query(Record).filter(Record.identifier==1234)
>
> Now that I've came to my senses I realize that it was a really bad idea to
> tackle so much stuff at once, since it makes it harder to stabilize the code
> and also its harder for someone else to sift through changes. I apologize
> for
> that.
>
> I will focus on finishing up these changes in order to get to a working
> state
> where all previous functionality is restored and will not change anything
> else,
> apart from the essential refactor to make the rest of the code work with
> the
> changes.
>
> I guess I need help in order to come up with a way to propose these changes.
> Should I prepare an RFC with all the changes together? Should I try to
> isolate
> them and propose each as an enhancement? Something else?
>
> Thanks a lot and sorry for the trouble ;)
>
>
>
>
> --
> ___________________________ ___ __
> Ricardo Garcia Silva
>
More information about the pycsw-devel
mailing list