How it works¶

This page covers more advanced topics to understand in depth how data models are created and how data is loaded to the database. This can help with troubleshooting or for advanced use cases.

Building a data model¶

A XML document is a tree-like structure where every element is either a simple type (i.e. a scalar value) or a complex type which has children which can be simple types or complex types. Elements can also have attributes, which are also scalar values, which xml2db will handle just as simple type children. In this page we will call "properties" the simple type children of an element or its attributes.

The general idea of xml2db is to convert complex types to a database tables, and simple types or XML attributes into columns in these tables. When a complex type has complex type children, they are themselves stored in other tables, and related to their parents with a foreign key constraint, or a relationship table which holds foreign key constraints to both related tables.

The data model created by xml2db is mostly bijective with the original XML document, which can in many cases be extracted and converted back to a XML document as it was loaded into the database.

Note

XSD specification allows multiple-root schemas. This means that XML documents conforming to this schema can have different root elements (XML files will always have just one root element, but they can be different between different XML files conforming to the same schema).

xml2db handles this by creating a "virtual" root table, named after the short_name provided to the DataModel constructor, which will act as a root node in all XML files loaded in this data model.

Hash-based deduplication¶

By default, xml2db tries to reduce storage footprint in the database by storing only once every subtree from the original XML document, if it is already present in the database. This deduplication process takes into account the whole subtree starting from a given element, and not only the direct children of an element.

Taking advantage of the initial tree structure, after parsing the XML document into a python dict, we compute a hash for each node, which includes all its properties and all its children hash, recursively. Two nodes with the same hash are thus identical, so only one of them needs to be stored, even if they appear under different parent nodes.

Hash are stored in the database, with a unique constraint (as a column with binary type named xml2db_record_hash). The primary key of all databases is an auto-incremented integer column, always named pk_table_name, table_name being the name of the table.

In some cases, especially when only few duplicates are expected, it may be more efficient to allow duplicated nodes in order to avoid extra tables to store relationships. This can be configured for each table of the data model.

Modeling relationships¶

Within a tree, parent-child relationship can have 1-1 or 1-n cardinality. As we want to reuse child nodes, we convert these relationships to n-1 or n-n, respectively. Besides, some children are optional. This does not affect the representation of relationships in any way.

A same child node (same hash) which is used under different parents will have several parents after the "recycling" process, while it had only one parent (because of the tree structure) in the initial dataset. This example illustrates a 1-1 relationship converted to n-1 after reusing nodes:

erDiagram
          TRADE ||--|| CONTRACT : concerns
          UNIQUE_TRADE }|--|| UNIQUE_CONTRACT : concerns

In that case, UNIQUE_TRADE holds a foreign key relationship with UNIQUE_CONTRACT, i.e. it has a column named fk_UNIQUE_CONTRACT which contains the primary key values of related UNIQUE_CONTRACT table (named pk_UNIQUE_CONTRACT in the UNIQUE_CONTRACT table).

Note

The form of each end of the links in the entity relationships diagram represent the cardinality of the relationship, which can be exactly 1, 0 or 1, 0 or many, 1 or many. You can find the symbols here.

For 1-n relationships, the same idea applies: a child node can have multiple parents if it was used under different parent nodes.

erDiagram
          CONTRACT ||--|{ DELIVERY_PROFILE : delivers
          UNIQUE_CONTRACT }|--|{ UNIQUE_DELIVERY_PROFILE : delivers

In a SQL relational data model, 1-n relationships can be easily represented with foreign keys. n-n relationships require however an additional table holding the relationship, which gives, for the last example with contracts and delivery profiles:

erDiagram
          CONTRACT ||--|{ CONTRACT_DELIVERY_PROFILE : is_in
          CONTRACT_DELIVERY_PROFILE }|--|| DELIVERY_PROFILE : involves

Note

By default, these n-n relationships which involve a third table to hold the relationship are not explicitly shown in the data model visualisation generated by DataModel.get_entity_rel_diagram, to favor readability. However, the relationships links on the diagram show an asterisk (*) when a relationship table is involved. These relationship tables are named using the name of the two related tables separated with an underscore.

Duplicated elements¶

As explained previously, deduplication can be opted out to avoid the complexity of an intermediary table holding a n-n relationship. In that case, the relationship will stay a 1-n relationship, which will be modeled with the child element holding a foreign key relationship to its parent.

In that case, no hash is stored for the children as there are effectively duplicated elements. The column fk_parent_tablename in the child table holds the primary keys of associated parent rows in the parent table tablename.

This choice is made for each table individually, and children of a duplicated element can effectively be themselves stored as deduplicated elements.

This means that as the end, there can be both 1-n and n-1 relationships involved to represent a tree structure, which means that the order of dependencies of the resulting tables will not be the same as the original tree structure. For instance, when we want to make sure to process all dependent tables before processing a table, we won't likely start with the root table of the tree, as it would be expected without the de-duplication process.

Caveats¶

xml2db handles a variety of data models, but does not cover all possible schemas allowed by the XML schema documents specification.

Some known cases which are not supported by xml2db are described below. Other cases can fail and may require some adjustments to work. We recommend thorough testing for more "exotic" schemas; for instance it is possible to implement "round-trip" tests from sample XML files to database and back to XML, and compare the resulting XML file with the original one.

Recursive XSD¶

Recursive XML schemas are not fully supported, because they result in cycles in tables dependencies, which would make the process much more complex. Whenever a field which would introduce a dependency cycle is detected in the XSD, it is discarded with a warning, which means that the corresponding data in XML files will not be imported. The rest of the data should be processed correctly.

Mixed content elements¶

XML elements with mixed content can contain both text and children elements (tags). xml2db offers partial support for this: the text value will be stored in a specific column named value, but it will not record the specific sequence of text and children elements within a node. When using Document.to_xml, text value will always appear before children elements.

Loading process¶

This section gives more detailed explanations on how parsing and loading data work. The integration of an XML file can be decomposed in lower level steps described below.

Parsing a XML document¶

First, we load all the XML document in memory using lxml and extract the data as a nested dict, where each node keeps a reference of its type, and store its data content. This task is achieved by the function XMLConverter.parse_xml.

This limits the size of files that can be loaded, due to memory limitations. The merging database transaction also limits the size of the files that can be loaded, depending on the server performance. On the other hand, handling data in memory makes the processing way simpler and faster. We handle files with a size around 500 MB without any issue.

Computing hashes¶

We compute tree hashes recursively by adding to each node's hash the hashes of its children element, be it simple types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema, so that hashing is really deterministic.

Right after this step, a hook function is called if provided in the configuration (top-level document_tree_hook option in the configuration dict), which gives direct access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used, for instance, to prune or modify some parts of the document tree.

Extracting data¶

We extract data from the tree structure based on the type (table name) they belong to, walking the data tree down. Doing so, we create temporary incremental primary keys for each row and perform deduplication when it is needed, using the hashes previously computed. We store relationship as they will be stored in the database, using the temporary primary key of the other table related rows in a foreign key column.

At this stage, we have a data representation that matches the one we will find in the database, except that it contains only the data from the XML file we just parsed, and the final primary keys and foreign keys won't be the same.

Loading the data¶

The data we converted is then loaded to the database in a separate set of tables, which have the same names that the target tables, but prefixed with temp_XXX (with XXX being a random 8 characters uuid string, by default).

We keep the primary keys from the flat data model created at the previous stage, as temporary keys.

Tip

The temp_prefix argument can be passed to the constructor of DataModel to use a specific prefix instead of the random one, which can be useful if you want to decompose the process of loading data and merging it with the target tables later, for instance to gain a finer control over concurrency.

Merging the data¶

The last step is to merge the temporary tables data into the target tables, while enforcing deduplication, keeping relationships, etc.

This is done by issuing a sequence of update and insert SQL statements using sqlalchemy, in a single transaction (default) or in multiple transactions.

The process boils down to:

inserting missing records into the target tables,
getting back the auto-incremented primary keys into the temporary tables,
updating relationship to use target primary keys instead of temporary primary keys,
continue with the next table.

Summing up¶

The whole loading process can be achieved by the high level functions DataModel.parse_xml and Document.insert_into_target_tables. However, the later can be decomposed in lower level function calls, for instance if you want to separate the loading to the temporary tables and the merge operation into the target tables. You can have a look at Document.insert_into_target_tables source code to see how the lower level steps are stitched together.

Extracting the data back to XML¶

Extracting the data from the database and converting it back to XML follow similar steps, in reverse order.

Info

Extracting the data from the database is not very optimized and is actually currently quite slow, mostly due to complex join queries to retrieve data based on a filter only on the top node. This feature is currently useful for roundtrip test, but has limited value otherwise, because of its poor performance compared to loading.

Querying data from the database¶

Walking through the data model tree, we query all tables using a chain of joins to the root table, on which we apply the where clause provided to the function DataModel.extract_from_database. Results are stored in a Document instance in flat tables, with their primary keys and foreign keys taken from the database.

Converting flat data to document tree¶

Starting from the flat data representation, we build a document tree recursively. This is done by the function Document.flat_data_to_doc_tree, which performs the opposite conversion as Document.doc_tree_to_flat_data.

Building an XML file¶

From the document tree, we build an XML file using XMLConverter.to_xml. This conversion is reversible with some caveat, regarding mostly the formatting of numbers and dates.