<div dir="ltr"><div><div>Hi PSC, all</div><div><br></div><div>(Sorry for this very long post,</div><div><br></div><div>TL;DR: I've been making some heavy changes to pycsw's codebase and would </div><div>like to find a nice way for them to be reviewed and possibly mainlined</div><div><br></div><div>)</div><div><br></div><div><br></div><div>I have recently began working with pycsw and am keen to contribute to the</div><div>project.</div><div><br></div><div>My motivation:</div><div><br></div><div>I am working at the portuguese sea and atmosphere institute, for a project</div><div>of the Global Land component of the Copernicus program. We produce</div><div>earth observation products based on the processing of satellite imagery.</div><div>Our production center is working 24/7 and we have new products every hour.</div><div>Each of these products gets also an INSPIRE compliant XML metadata record</div><div>that is fed into our CSW catalogue and is available for searching to our</div><div>project's partners. Being a Python developer, I am really interested in </div><div>moving our current CSW solution to a Python codebase. PyCSW seems like the </div><div>perfect fit, with its outstanding standards compliancy record.</div><div><br></div><div>Recently I opened an issue on github[1] asking for a change in PyCSW's database</div><div>schema. I wanted to switch over to using native datetime types in instead of</div><div>plain strings. Eventually, Angelos replied that such a change is not trivial</div><div>because it would mean breaking compatibility with older versions and probably</div><div>some upgrade problems to current deployments (a very wise opinion indeed). He</div><div>also reccommended that I'd bring the discussion over to the mailing list.</div><div><br></div><div>[1] - <a href="https://github.com/geopython/pycsw/issues/365">https://github.com/geopython/pycsw/issues/365</a></div><div><br></div><div>Before doing so, I decided to take a more thorough look into pycsw's codebase</div><div>in order to get a better notion of the impact of the proposed change.</div><div><br></div><div>As I began my 'pycsw innards code tour', I started spotting some corners where</div><div>there seemed to be oportunities to 'improve' the code (not that it is bad per </div><div>se, I mean to make it more readable, easier to maintain and test, ect). I </div><div>could not help immediately starting to hack into it and refactor some of this </div><div>stuff. In my frenzy, I ended up changing a lot of stuff, and it is not even </div><div>properly done yet. However, I feel like the changes are really beneficial to </div><div>pycsw, so I'd like them to be merged (once they are ready).</div><div><br></div><div>I'd like to get some assistance from the PSC in order to figure out a way to</div><div>propose these changes, so that they could be properly reviewed and </div><div>accepted/rejected.</div><div><br></div><div>Currently, all these changes are piled up in the refactor-setupdb branch in my</div><div>fork of pycsw at:</div><div><br></div><div><a href="https://github.com/ricardogsilva/pycsw/tree/refactor-setupdb">https://github.com/ricardogsilva/pycsw/tree/refactor-setupdb</a></div><div><br></div><div><br></div><div>Here is an overview on what I have been working on since last week:</div><div><br></div><div>* Switched to using sqlalchemy's declarative extension. This allows a number </div><div> of simplifications to be made in pycsw's codebase:</div><div><br></div><div> * The definition of repository tables is now on a separate </div><div> `pycsw.core.models` module, making it easier to find. Previously, the </div><div> definition of the tables was done using sqlalchemy's classical mappings </div><div> and was present in the `pycsw.core.admin` module</div><div><br></div><div> * Having an explicitly defined class for metadata records allows us to</div><div> reference it directly in the code whenever there is a need to query the</div><div> database. This means we no longer need to use reflection of the database</div><div> schema whenever we need to work with a record. We can call</div><div> `from pycsw.core.models import Record` and then use the `Record` class</div><div><br></div><div> * The declarative extension allows us to remap the names of the database </div><div> table and columns on the fly without changing the names of the attributes</div><div> on the mapped class. This means that:</div><div><br></div><div> * The user can still define custom database mappings in the same way as </div><div> before</div><div><br></div><div> * Internally, pycsw does not need to care about the actual names of the </div><div> database columns because the names of the attributes on the `Record` </div><div> class remain the same.</div><div><br></div><div> The benefit of this change is that we no longer need to be converting from</div><div> `pycsw:Identifier` (and the other mappings) all the time, we can use</div><div> `record.identifier`. Whenever the user changes the mappings, a quick call to</div><div> `context.update_mappings` takes care of the underlying changes in </div><div> sqlalchemy's communication with the database.</div><div><br></div><div> This change also opens up a number of interesting possibilities that we </div><div> may evaluate in the future. One such possibility is adding support for</div><div> changing the database schema using the concept of migrations. The </div><div> sqlalchemy author also maintains the alembic project that deals with this.</div><div> If at some point pycsw chooses to alter the schema of the database, it </div><div> should become easier to upgrade existing deployments by providing a clear</div><div> migration path and some scripts to automate the process.</div><div><br></div><div>* Refactored the metadata parsing code to take advantage of the new</div><div> models.Record class.</div><div><br></div><div>* Extended the repository concept to deal with everything related with the</div><div> database.</div><div><br></div><div> Currently there exists the `pycsw.core.repository` module that works with</div><div> an already existing database and the `pycsw.core.admin` which features the</div><div> `setup_db` function to take care of database initialization.</div><div><br></div><div> The proposed solution refactors the code into the `pycsw.core.repositories`</div><div> package. There is now a base class `pycsw.core.repositories.base.Repository`</div><div> that performs both tasks of initializing the database and interacting</div><div> with it. The base class is then extended into several subclasses that deal</div><div> mostly with the initialization of each particular backend.</div><div><br></div><div>* Unified configuration settings and initialization tasks into a common object</div><div><br></div><div> The current approach is to use `pycsw.core.config.StaticContext` as an</div><div> aggregator of several settings that are used by other parts of the codebase.</div><div> However, since this class is not used to store the user-defined settings,</div><div> there seems to be an opportunity to take the aggregation concept a step</div><div> further</div><div><br></div><div> The proposed solution implements the `pycsw.core.configuration` package. This</div><div> package includes the `configuration.Context` class that is used to group</div><div> together in a single entity:</div><div><br></div><div> * pycsw's internal data model and settings</div><div> * user defined settings</div><div> * the current repository</div><div> * the current database mappings</div><div><br></div><div> The Context class becomes central to pycsw's workflow, since it stores</div><div> references to all runtime and static resources.</div><div><br></div><div> It can be initialized with a number of different resources:</div><div><br></div><div> * A .cfg file that is read with python's SafeConfigParser, just as before</div><div> * An in-memory SafeConfigParser object or dict object (that could have </div><div> been passed by the cgi, wsgi or other wrapper), just as before</div><div> * A .json file that is parsed into a dict with python's json module</div><div><br></div><div> Another nice effect of this change is that the user settings are now stored</div><div> internally as class attributes and not as section/option pairs in a </div><div> SafeconfigParser. This allows using python types such as booleans, lists, </div><div> etc in the runtime settings, which simplifies the parsing of these values</div><div><br></div><div> The context uses default values for all settings, which makes it possible to</div><div> only override some of them and still have working code. For example, one </div><div> could setup a context for testing using an in-memory sqlite database by </div><div> calling:</div><div><br></div><div> .. code:: python</div><div><br></div><div> from pycsw.core.configuration.configuration import Context</div><div> from pycsw.core.models import Record</div><div> from pycsw.core.etree import etree</div><div><br></div><div> ctx = Context({"database": "sqlite:///:memory:"})</div><div> ctx.repository.setup_db()</div><div> exml = etree.parse("some_xml_record")</div><div> ctx.repository.insert_record(exml)</div><div> ctx.repository.query(Record).filter(Record.identifier==1234)</div><div><br></div><div>Now that I've came to my senses I realize that it was a really bad idea to </div><div>tackle so much stuff at once, since it makes it harder to stabilize the code</div><div>and also its harder for someone else to sift through changes. I apologize for</div><div>that.</div><div><br></div><div>I will focus on finishing up these changes in order to get to a working state </div><div>where all previous functionality is restored and will not change anything else,</div><div>apart from the essential refactor to make the rest of the code work with the </div><div>changes.</div><div><br></div><div>I guess I need help in order to come up with a way to propose these changes.</div><div>Should I prepare an RFC with all the changes together? Should I try to isolate</div><div>them and propose each as an enhancement? Something else?</div><div><br></div><div>Thanks a lot and sorry for the trouble ;)</div></div><div><br></div><div><br></div><br clear="all"><div><br></div>-- <br><div class="gmail_signature">___________________________ ___ __<br>Ricardo Garcia Silva</div>
</div>