Skip to content

DataModel

A class to manage a data model based on an XML schema and its database equivalent.

It is the main entry point for xml2db.

This class allows parsing an XSD file to build a representation of the XML schema, simplify it and convert it into a set of database tables. It also allows parsing XML documents that fit this XML schema and importing their content into a database.

Parameters:

Name Type Description Default
xsd_file str

A path to a XSD file

required
short_name str

A short name for the schema

None
long_name str

A longer name for the schema

None
base_url str

The root folder to find other dependant XSD files (by default, the location of the provided XSD file)

None
model_config dict

A config dict to provide options for building the model (full options available here: Configuring your data model)

None
connection_string str

A database connection string (optional if you will not be loading data)

None
db_engine Engine

A sqlalchemy.Engine to use to connect to the database (it takes precedence over connection_string and is optional if you will not be loading data)

None
db_type str

The targeted database backend (postgresql, mssql, mysql...). It is ignored and inferred from connection_string or db_engine, if provided

None
db_schema str

A schema name to use in the database

None
temp_prefix str

A prefix to use for temporary tables (if None, will be generated randomly)

None

Attributes:

Name Type Description
xml_schema

The xmlschema.XMLSchema object associated with this data model

lxml_schema

The lxml.etree.XMLSchema object associated with this data model

data_flow_name

A short identifier used for the data model (short_name argument value)

data_flow_long_name

A longer for the data model (long_name argument value)

db_schema

A database schema name to store the database tables

source_tree

A text representation of the source data model tree

target_tree

A text representation of the simplified data model tree which will be used to create target tables

Examples:

Create a DataModel like this:

>>> data_model = DataModel(
>>>     xsd_file="path/to/file.xsd",
>>>     connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
>>> )
Source code in xml2db/model.py
def __init__(
    self,
    xsd_file: str,
    short_name: str = None,
    long_name: str = None,
    base_url: str = None,
    model_config: dict = None,
    connection_string: str = None,
    db_engine: sqlalchemy.Engine = None,
    db_type: str = None,
    db_schema: str = None,
    temp_prefix: str = None,
):
    self.model_config = self._validate_config(model_config)
    self.tables_config = model_config.get("tables", {}) if model_config else {}

    xsd_file_name = xsd_file
    if base_url is None:
        base_url = os.path.normpath(os.path.dirname(xsd_file))
        xsd_file_name = os.path.basename(xsd_file)

    self.xml_schema = xmlschema.XMLSchema(xsd_file_name, base_url=base_url)
    self.lxml_schema = etree.XMLSchema(etree.parse(xsd_file))

    self.xml_converter = XMLConverter(data_model=self)
    self.data_flow_name = short_name
    self.data_flow_long_name = long_name

    if connection_string is None and db_engine is None:
        logger.warning(
            "DataModel created without connection string cannot do actual imports"
        )
        self.engine = None
        self.db_type = db_type
    else:
        if db_engine:
            self.engine = db_engine
        else:
            engine_options = {}
            if "mssql" in connection_string:
                engine_options = {
                    "fast_executemany": True,
                    "isolation_level": "SERIALIZABLE",
                }
            self.engine = create_engine(
                connection_string,
                **engine_options,
            )
        self.db_type = self.engine.dialect.name

    self.db_schema = db_schema
    self.temp_prefix = str(uuid4())[:8] if temp_prefix is None else temp_prefix

    self.tables = {}
    self.names_types_map = {}
    self.root_table = None

    self.types_transforms = {}
    self.fields_transforms = {}
    self.ordered_tables_keys = []
    self.transaction_groups = []
    self.source_tree = ""
    self.target_tree = ""
    self.metadata = MetaData()
    self.processed_at = datetime.now()

    self._build_model()

fk_ordered_tables: Iterable[Union[DataModelTableDuplicated, DataModelTableReused]] property

Yields tables in create/insert order (tables referenced in foreign keys first)

fk_ordered_tables_reversed: Iterable[Union[DataModelTableDuplicated, DataModelTableReused]] property

Yields tables in drop/delete order (tables referencing foreign keys first)

create_all_tables(temp=False)

Create tables for the data model, either target tables or temp tables used to import data.

You do not have to call this method explicitly when using Document.insert_into_target_tables(), which will create tables if they do not exist.

Parameters:

Name Type Description Default
temp bool

If False, create target tables (unprefixed). If True, create temporary (prefixed) tables.

False
Source code in xml2db/model.py
def create_all_tables(self, temp: bool = False) -> None:
    """Create tables for the data model, either target tables or temp tables used to import data.

    You do not have to call this method explicitly when using
        [`Document.insert_into_target_tables()`](document.md#xml2db.document.Document.insert_into_target_tables),
        which will create tables if they do not exist.

    Args:
        temp: If `False`, create target tables (unprefixed). If `True`, create temporary (prefixed) tables.
    """
    for tb in self.fk_ordered_tables:
        tb.create_tables(self.engine, temp)

create_db_schema()

Create database schema if it does not already exist.

You do not have to call this method explicitly when using Document.insert_into_target_tables().

Source code in xml2db/model.py
def create_db_schema(self) -> None:
    """Create database schema if it does not already exist.

    You do not have to call this method explicitly when using
        [`Document.insert_into_target_tables()`](document.md#xml2db.document.Document.insert_into_target_tables).
    """

    def do_create_schema():
        with self.engine.connect() as conn:
            conn.execute(sqlalchemy.schema.CreateSchema(self.db_schema))
            conn.commit()

    if self.db_schema is not None:
        if self.db_type == "duckdb":
            try:
                do_create_schema()
            except ProgrammingError:
                pass
        else:
            inspector = inspect(self.engine)
            if self.db_schema not in inspector.get_schema_names():
                do_create_schema()

        logger.info(f"Created schema: {self.db_schema}")

drop_all_tables()

Drop the data model target (unprefixed) tables.

Danger

BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!

Source code in xml2db/model.py
def drop_all_tables(self):
    """Drop the data model target (unprefixed) tables.

    Danger:
        BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!
    """
    for tb in self.fk_ordered_tables_reversed:
        tb.drop_tables(self.engine)

drop_all_temp_tables()

Drop the data model temporary (prefixed) tables.

Danger

BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!

Source code in xml2db/model.py
def drop_all_temp_tables(self):
    """Drop the data model temporary (prefixed) tables.

    Danger:
        BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!
    """
    for tb in self.fk_ordered_tables_reversed:
        tb.drop_temp_tables(self.engine)

extract_from_database(root_select_where, force_tz=None)

Extract a document from the database, based on a where clause applied to the root table. For instance, you can use the column xml2db_input_file_path to filter the data loaded from a specific file.

It will query all the data in the database corresponding to the rows that you select from the root table of your data model. Typically, a single XML file will correspond to a single row in the root table. This function will query the data tree below this record.

This method was not optimized for performance and can be quite slow. It is used in integration tests to check the output against the data inserted into the database.

Parameters:

Name Type Description Default
root_select_where str

A where clause to filter the root table of the model, as a string

required
force_tz Union[str, None]

Apply this timezone if database returns timezone-naïve datetime

None

Returns:

Type Description
Document

A Document object containing extracted data

Examples:

Source code in xml2db/model.py
def extract_from_database(
    self,
    root_select_where: str,
    force_tz: Union[str, None] = None,
) -> Document:
    """Extract a document from the database, based on a where clause applied to the root table. For instance, you
        can use the column `xml2db_input_file_path` to filter the data loaded from a specific file.

    It will query all the data in the database corresponding to the rows that you select from the root table of your
        data model. Typically, a single XML file will correspond to a single row in the root table. This function
        will query the data tree below this record.

    This method was not optimized for performance and can be quite slow. It is used in integration tests to check
        the output against the data inserted into the database.

    Args:
        root_select_where: A where clause to filter the root table of the model, as a string
        force_tz: Apply this timezone if database returns timezone-naïve datetime

    Returns:
        A [`Document`](document.md) object containing extracted data

    Examples:

    """
    doc = Document(self)
    doc.extract_from_database(self.root_table, root_select_where, force_tz=force_tz)
    return doc

get_all_create_index_statements()

Yield create index statements for all tables

Source code in xml2db/model.py
def get_all_create_index_statements(self) -> Iterable[CreateIndex]:
    """Yield create index statements for all tables"""
    for tb in self.fk_ordered_tables:
        yield from tb.get_create_index_statements()

get_all_create_table_statements(temp=False)

Yield sqlalchemy create table statements for all tables

Parameters:

Name Type Description Default
temp bool

If False, yield create table statements for target tables (unprefixed). If True, yield create table statements for temporary tables (prefixed).

False
Source code in xml2db/model.py
def get_all_create_table_statements(
    self, temp: bool = False
) -> Iterable[CreateTable]:
    """Yield sqlalchemy `create table` statements for all tables

    Args:
        temp: If `False`, yield create table statements for target tables (unprefixed). If True, yield create
            table statements for temporary tables (prefixed).
    """
    for tb in self.fk_ordered_tables:
        yield from tb.get_create_table_statements(temp)

get_entity_rel_diagram(text_context=True)

Build an entity relationship diagram for the data model

The ERD syntax is used by mermaid.js to create a visual representation of the diagram, which is supported by Pycharm IDE or GitHub in markdown files, among others

Parameters:

Name Type Description Default
text_context bool

Should we add a title, a text explanation, etc. or just the ERD?

True

Returns:

Type Description
str

A string representation of the ERD

Source code in xml2db/model.py
def get_entity_rel_diagram(self, text_context: bool = True) -> str:
    """Build an entity relationship diagram for the data model

    The ERD syntax is used by mermaid.js to create a visual representation of the diagram, which is supported
    by Pycharm IDE or GitHub in markdown files, among others

    Args:
        text_context: Should we add a title, a text explanation, etc. or just the ERD?

    Returns:
        A string representation of the ERD
    """
    out = ["erDiagram"]
    for tb in self.fk_ordered_tables_reversed:
        out += tb.get_entity_rel_diagram()

    if text_context:
        out = (
            [
                f"# {self.data_flow_long_name}\n",
                f"### Data model name: `{self.data_flow_name}`\n",
                (
                    "The following *Entity Relationships Diagram* represents the target data model, after the "
                    "simplification of the source data model, but before the transformations performed to optimize "
                    "data storage (transformation of `1-1` and `1-n` relationships into `n-1` and `n-n` "
                    "relationships, respectively.\n"
                ),
                (
                    "As a consequence, not all tables of the actual data model used in the database are shown. "
                    "Specifically, `1-n` relationships presented may be stored in the database using an additional "
                    "relationship table (noted with an asterisk in the relationship name).\n"
                ),
                "```mermaid",
            ]
            + out
            + [
                "```",
                (
                    "`-N` suffix in field type indicates that the field can have multiple values, which will be "
                    "stored as comma separated values."
                ),
            ]
        )
    return "\n".join(out)

parse_xml(xml_file, metadata=None, skip_validation=True, iterparse=True, recover=False, flat_data=None)

Parse an XML document based on this data model

This method is just a wrapper around the parse_xml method of the Document class.

Parameters:

Name Type Description Default
xml_file Union[str, BytesIO]

The path or the file object of an XML file to parse

required
metadata dict

A dict of metadata values to add to the root table (a value for each key defined in metadata_columns passed to model config)

None
skip_validation bool

Should we validate the documents against the schema first?

True
iterparse bool

Parse XML using iterative parsing, which is a bit slower but uses less memory

True
recover bool

Should we try to parse incorrect XML? (argument passed to lxml parser)

False
flat_data dict

A dict containing flat data if we want to add data to another dataset instead of creating a new one

None

Returns:

Type Description
Document

A parsed Document object

Source code in xml2db/model.py
def parse_xml(
    self,
    xml_file: Union[str, BytesIO],
    metadata: dict = None,
    skip_validation: bool = True,
    iterparse: bool = True,
    recover: bool = False,
    flat_data: dict = None,
) -> Document:
    """Parse an XML document based on this data model

    This method is just a wrapper around the parse_xml method of the Document class.

    Args:
        xml_file: The path or the file object of an XML file to parse
        metadata: A dict of metadata values to add to the root table (a value for each key defined in
            `metadata_columns` passed to model config)
        skip_validation: Should we validate the documents against the schema first?
        iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
        recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
        flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
            a new one

    Returns:
        A parsed [`Document`](document.md) object
    """
    doc = Document(self)
    doc.parse_xml(
        xml_file=xml_file,
        metadata=metadata,
        skip_validation=skip_validation,
        iterparse=iterparse,
        recover=recover,
        flat_data=flat_data,
    )
    return doc