Skip to content

Configuring your data model

The data model in the database is derived automatically from a XML schema definition file (XSD) you provide. It is a set of tables linked by foreign keys relationships. Basically, each complexType of the XML schema definition corresponds to a table in the target database data model. Each table is named after the first element name of this type, with de-duplication if needed. Columns in a table corresponds to simpleType elements within a complex type and its attributes. Columns are named after the names of children XML elements or attributes.

xml2db applies a few simplifications to the original data model by default, but they can also be opted-out or forced through the configuration dict provided to the DataModel constructor.

The column types can also be configured to override the default type mapping, using sqlalchemy types.

Tip

We recommend that you first build the data model without any configuration, visualize it as a text tree or ER diagram (see the Getting started page for directions on how to visualize data models) and then adapt the configuration if need be.

Configuration options are described below. Some options can be set at the model level, others at the table level and others at the field level. The general structure of the configuration dict is the following:

Model config general structure
{
    "document_tree_hook": None,
    "document_tree_node_hook": None,
    "row_numbers": False,
    "as_columnstore": False,
    "metadata_columns": None,
    "tables": {
        "table1": {
            "reuse": True,
            "choice_transform": False,
            "as_columnstore": False,
            "fields": {
                "my_column": {
                    "type": None #default type
                } 
            },
            "extra_args": [],
        }
    }
}

Model configuration

The following options can be passed as a top-level keys of the model configuration dict:

  • document_tree_hook (Callable): sets a hook function which can modify the data extracted from the XML. It gives direct access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used, for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree should of course stay compatible with the data model.
  • document_tree_node_hook (Callable): sets a hook function which can modify the data extracted from the XML. It is similar with document_tree_hook, but it is call as soon as a node is completed, not waiting for the entire parsing to finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing.
  • row_numbers (bool): adds xml2db_row_number columns either to n-n relationships tables, or directly to data tables when deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The default value is False (disabled).
  • as_columnstore (bool): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at the table level for each table. However, for n-n relationships tables, this option is the only way to configure the clustered columnstore indexes. The default value is False (disabled).
  • metadata_columns (list): a list of extra columns that you want to add to the root table of your model. This is useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified as dicts, the only required keys are name and type (a SQLAlchemy type object); other keys will be passed directly as keyword arguments to sqlalchemy.Column. Actual values need to be passed to DataModel.parse_xml for each parsed documents, as a dict, using the metadata argument.
  • record_hash_column_name: the column name to use to store records hash data (defaults to xml2db_record_hash).
  • record_hash_constructor: a function used to build a hash, with a signature similar to hashlib constructor functions (defaults to hashlib.sha1).
  • record_hash_size: the byte size of the record hash (defaults to 20, which is the size of a sha-1 hash).

Fields configuration

These configuration options are defined for a specific field of a specific table. A "field" refers to a column in the table, or a child table.

Data types

By default, the data type defined in the database table for each column is based on a mapping between the data type indicated in the XSD and a corresponding sqlalchemy type implemented in the following three functions:

Default: types_mapping_default

Defines the sqlalchemy type to use for given column properties in target tables

Parameters:

Name Type Description Default
temp bool

are we targeting the temporary tables schema or the final tables?

required
col DataModelColumn

an object representing a column of a table for which we are determining the SQL type to define

required

Returns:

Type Description
Any

a sqlalchemy class representing the data type to be used

Source code in xml2db/table/column.py
def types_mapping_default(temp: bool, col: "DataModelColumn") -> Any:
    """Defines the sqlalchemy type to use for given column properties in target tables

    Args:
        temp: are we targeting the temporary tables schema or the final tables?
        col: an object representing a column of a table for which we are determining the SQL type to define

    Returns:
        a sqlalchemy class representing the data type to be used
    """
    if col.occurs[1] != 1:
        return String(8000)
    if col.data_type in ["decimal", "float"]:
        return Double
    if col.data_type == "dateTime":
        return DateTime(timezone=True)
    if col.data_type == "integer" or col.data_type == "int":
        return Integer
    if col.data_type == "boolean":
        return Boolean
    if col.data_type == "byte":
        return SmallInteger
    if col.data_type == "long":
        return BigInteger
    if col.data_type == "date":
        return String(16)
    if col.data_type == "time":
        return String(18)
    if col.data_type in ["string", "NMTOKEN", "duration", "token"]:
        if col.max_length is None:
            return String(1000)
        min_length = 0 if col.min_length is None else col.min_length
        if min_length >= col.max_length - 1 and not col.allow_empty:
            return String(col.max_length)
        return String(col.max_length)
    if col.data_type == "binary":
        return LargeBinary(col.max_length)
    else:
        logger.warning(
            f"unknown type '{col.data_type}' for column '{col.name}', defaulting to VARCHAR(1000) "
            f"(this can be overridden by providing a field type in the configuration)"
        )
        return String(1000)
MySQL: types_mapping_mysql

Defines the MySQL/sqlalchemy type to use for given column properties in target tables

Parameters:

Name Type Description Default
temp bool

are we targeting the temporary tables schema or the final tables?

required
col DataModelColumn

an object representing a column of a table for which we are determining the SQL type to define

required

Returns:

Type Description
Any

a sqlalchemy class representing the data type to be used

Source code in xml2db/table/column.py
def types_mapping_mysql(temp: bool, col: "DataModelColumn") -> Any:
    """Defines the MySQL/sqlalchemy type to use for given column properties in target tables

    Args:
        temp: are we targeting the temporary tables schema or the final tables?
        col: an object representing a column of a table for which we are determining the SQL type to define

    Returns:
        a sqlalchemy class representing the data type to be used
    """
    if col.occurs[1] != 1:
        return String(4000)
    if col.data_type in ["string", "NMTOKEN", "duration", "token"]:
        if col.max_length is None:
            return String(255)
    if col.data_type == "binary":
        if col.max_length == col.min_length:
            return mysql.BINARY(col.max_length)
        return mysql.VARBINARY(col.max_length)
    return types_mapping_default(temp, col)
MSSQL: types_mapping_mssql

Defines the MSSQL type to use for given column properties in target tables

Parameters:

Name Type Description Default
temp bool

are we targeting the temporary tables schema or the final tables?

required
col DataModelColumn

an object representing a column of a table for which we are determining the SQL type to define

required

Returns:

Type Description
Any

a sqlalchemy class representing the data type to be used

Source code in xml2db/table/column.py
def types_mapping_mssql(temp: bool, col: "DataModelColumn") -> Any:
    """Defines the MSSQL type to use for given column properties in target tables

    Args:
        temp: are we targeting the temporary tables schema or the final tables?
        col: an object representing a column of a table for which we are determining the SQL type to define

    Returns:
        a sqlalchemy class representing the data type to be used
    """
    if col.occurs[1] != 1:
        return mssql.VARCHAR(8000)
    if col.data_type in ["decimal", "float"]:
        return Double
    if col.data_type == "dateTime":
        # using the DATETIMEOFFSET directly in the temporary table caused issues when inserting data in the target
        # table with INSERT INTO SELECT converts datetime VARCHAR to DATETIMEOFFSET without errors
        return mssql.VARCHAR(100) if temp else mssql.DATETIMEOFFSET
    if col.data_type == "integer" or col.data_type == "int":
        return Integer
    if col.data_type == "boolean":
        return Boolean
    if col.data_type == "byte":
        return SmallInteger
    if col.data_type == "long":
        return BigInteger
    if col.data_type == "date":
        return mssql.VARCHAR(16)
    if col.data_type == "time":
        return mssql.VARCHAR(18)
    if col.data_type in ["string", "NMTOKEN", "duration", "token"]:
        if col.max_length is None:
            return mssql.VARCHAR(1000)
        min_length = 0 if col.min_length is None else col.min_length
        if min_length >= col.max_length - 1 and not col.allow_empty:
            return mssql.CHAR(col.max_length)
        return mssql.VARCHAR(col.max_length)
    if col.data_type == "binary":
        if col.max_length == col.min_length:
            return mssql.BINARY(col.max_length)
        return mssql.VARBINARY(col.max_length)
    else:
        logger.warning(
            f"unknown type '{col.data_type}' for column '{col.name}', defaulting to VARCHAR(1000) "
            f"(this can be overridden by providing a field type in the configuration)"
        )
        return mssql.VARCHAR(1000)

You may override this mapping by specifying a column type for any field in the model config. Custom column types are defined as sqlalchemy types and will be passed to the sqlalchemy.Column constructor as is.

Example

If the XSD mentions the integer type for column my_column in table my_table, by default, xml2dbwill map to sqlalchemy.Integer. For instance, if you want it to map to mssql.BIGINT instead, you can provide this config:

import xml2db
from sqlalchemy.dialects import mssql

model_config = {
    "tables": {
        "my_table": {
            "fields": {
                "my_column": {
                    "type": mssql.BIGINT
                }
            }
        },
    },
}

data_model = xml2db.DataModel(
    xsd_file="path/to/file.xsd", db_schema="my_schema", model_config=model_config
)

You can infer my_table and my_column when visualizing the data model.

Joining values for simple types

By default, XML simple type elements with types in ["string", "date", "dateTime", "NMTOKEN", "time"] and max occurrences >= 1 are joined in one column as comma separated values and optionally wrapped in double quotes if they contain commas (an Excel-like csv format, which can be queried with LIKE statements in SQL).

Configuration: "transform": "join" (default). It is not currently possible to use False to opt-out of an automatically applied join, as it would require a complex process of adding a new table.

Example

This config option is currently not very useful as it cannot be opted out.

model_config = {
   "tables": {
       "my_table_name": {
           "fields": {
               "my_field_name": {
                   "transform": "join"
               }
           }
       }
   }
}

Elevate children to upper level

If a complex child element has a minimum and maximum occurrences number of 1 and 1 respectively, it can be "pulled" up to its parent element. This behaviour will always be applied by default.

If a complex child element has a minimum and maximum occurrences number of 0 and 1 respectively, it can also be "pulled" up to its parent element fields. This is applied by default if the child has less than 5 fields, because otherwise it could clutter the parent element with many columns that will often be all NULL.

This simplification can be opted out using a configuration option, and forced in the case of a child with more than 5 fields, using the following option:

"transform": "elevate" (default) or "elevate_wo_prefix" or False (disable).

By default, the elevated field name is prefixed with the name of the complex child so its origin is clear and to prevent duplicated names, but this prefixing can be avoided with the value "elevate_wo_prefix".

For example, complex child timeInterval with 2 fields of max occurrence 1, before elevation...

# Child table
timeInterval[1, 1]:
    start[1, 1]: string
    end[1, 1]: string

... and after elevation (with prefix):

# Parent fields
timeInterval_start[1, 1]: string
timeInterval_end[1, 1]: string

Example

Force "elevation" of a complex type to its parent:

model_config = {
    "tables": {
        "contract": {
            "fields": {
                "docStatus": {
                    "transform": "elevate"
                }
            }
        }
    }
}

Tables configuration

Simplify "choice groups"

In XML schemas, choice groups are quite frequent. It means that only one of its possible children types should be present.

Here we consider only choice groups of simple elements (not complex types). The naive way to convert this to a table is to create one column for each possible choice, of which only one will have a non NULL value for each record.

If there are more than 2 possible choice options and the simple elements are of the same type, they can be transformed into two columns:

  • type with the name of the element
  • value with its value

Example of choice child in a table, before...

idOfMarketParticipant[1, 1] (choice):
   lei[1, 1]: string
   bic[1, 1]: string
   eic[1, 1]: string
   gln[1, 1]: string

... and after choice transformation:

idOfMarketParticipant[1, 1] (choice):
   type[1, 1]: string  # with possible values: ["lei", "bic", "eic", "gln"]
   value[1, 1]: string

This simplification is applied by default when there are more than 2 options of the same data type, but it can be opted in or out otherwise, with the following option:

"choice_transform": True (default) or False (disable)

Example

Disable choice group simplification for a choice group:

model_config = {
    "tables": {
        "my_table_name": {
            "choice_transform": False
        }
    }
}

Deduplication

By default, xml2db will try to deduplicate elements (store identical element only once in the database) in order to reduce storage footprint, which is particularly relevant for "feature" fields in XML schemas, meaning when a XML element specify a feature as a child element, which is shared with many other elements.

This is done using a hash of each node in the XML file, which includes recursively all its children. The detailed process is described in the how it works page.

The implication is that relationships with 1-1 or 1-n cardinality in the XML schema are converted by default into n-1 and n-n relationships in the database. For n-n, relationships, it means that there is an additional relationship table which has foreign keys relations to both tables in the relationship.

This behaviour can be opted-out, for instance if you know that there will be mostly unique elements and you prefer not having the additional relationship table. The 1-n relationship will be modelled using only a foreign key to the parent, without an intermediate table holding the relationship, which makes the data model simpler, and maybe some queries faster, but stores more records in case of duplicated records.

Configuration: "reuse": True (default) or False (disable)

Example

Disabling deduplication for a given table:

model_config = {
    "tables": {
        "my_table": {"reuse": False}
    }
}

Columnstore Clustered Index

With MS SQL Server database backend, xml2db can create Clustered Columnstore indexes on tables. However, for n-n relationships tables, this option needs to be set globally (see below). The default value is False (disabled).

Extra arguments

Extra arguments can be passed to sqlalchemy.Table constructors, for instance if you want to customize indexes. These can be passed in an iterable (e.g. tuple or list) which will be simply unpacked into the sqlalchemy.Table constructor when building the table.

Configuration: "extra_args": [] (default)

Example

Adding an index on a specific column:

model_config = {
    "tables": {
        "my_table": {
            "extra_args": sqlalchemy.Index("my_index", "my_column1", "my_column2"),
        }
    }
}