DataModel¶

A class to manage a data model based on an XML schema and its database equivalent.

It is the main entry point for xml2db.

This class allows parsing an XSD file to build a representation of the XML schema, simplify it and convert it into a set of database tables. It also allows parsing XML documents that fit this XML schema and importing their content into a database.

Parameters:

Name	Type	Description	Default
`xsd_file`	`str`	A path to a XSD file	required
`short_name`	`str`	A short name for the schema	`'DocumentRoot'`
`long_name`	`str`	A longer name for the schema	`None`
`base_url`	`str`	The root folder to find other dependant XSD files (by default, the location of the provided XSD file)	`None`
`model_config`	`dict`	A config dict to provide options for building the model (full options available here: Configuring your data model)	`None`
`connection_string`	`str`	A database connection string (optional if you will not be loading data)	`None`
`db_engine`	`Engine`	A `sqlalchemy.Engine` to use to connect to the database (it takes precedence over `connection_string` and is optional if you will not be loading data)	`None`
`db_type`	`str`	The targeted database backend (`postgresql`, `mssql`, `mysql`...). It is ignored and inferred from `connection_string` or `db_engine`, if provided	`None`
`db_schema`	`str`	A schema name to use in the database	`None`
`temp_prefix`	`str`	A prefix to use for temporary tables (if `None`, will be generated randomly)	`None`

Attributes:

Name	Type	Description
`xml_schema`		The `xmlschema.XMLSchema` object associated with this data model
`lxml_schema`		The `lxml.etree.XMLSchema` object associated with this data model
`data_flow_name`		A short identifier used for the data model (`short_name` argument value)
`data_flow_long_name`		A longer for the data model (`long_name` argument value)
`db_schema`		A database schema name to store the database tables
`source_tree`		A text representation of the source data model tree
`target_tree`		A text representation of the simplified data model tree which will be used to create target tables

Examples:

Create a DataModel like this:

>>> data_model = DataModel(
>>>     xsd_file="path/to/file.xsd",
>>>     connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
>>> )

Source code in xml2db/model.py

def __init__(
    self,
    xsd_file: str,
    short_name: str = "DocumentRoot",
    long_name: str = None,
    base_url: str = None,
    model_config: dict = None,
    connection_string: str = None,
    db_engine: sqlalchemy.Engine = None,
    db_type: str = None,
    db_schema: str = None,
    temp_prefix: str = None,
):
    self.model_config = self._validate_config(model_config)
    self.tables_config = model_config.get("tables", {}) if model_config else {}

    xsd_file_name = xsd_file
    if base_url is None:
        base_url = os.path.normpath(os.path.dirname(xsd_file))
        xsd_file_name = os.path.basename(xsd_file)

    self.xml_schema = xmlschema.XMLSchema(xsd_file_name, base_url=base_url)
    self.lxml_schema = etree.XMLSchema(etree.parse(xsd_file))

    self.xml_converter = XMLConverter(data_model=self)
    self.data_flow_name = short_name
    self.data_flow_long_name = long_name

    if connection_string is None and db_engine is None:
        logger.warning(
            "DataModel created without connection string cannot do actual imports"
        )
        self.engine = None
        self.db_type = db_type
    else:
        if db_engine:
            self.engine = db_engine
        else:
            engine_options = {}
            if "mssql" in connection_string:
                engine_options = {
                    "fast_executemany": True,
                    "isolation_level": "SERIALIZABLE",
                }
            self.engine = create_engine(
                connection_string,
                **engine_options,
            )
        self.db_type = self.engine.dialect.name

    self.db_schema = db_schema
    self.temp_prefix = str(uuid4())[:8] if temp_prefix is None else temp_prefix

    self.tables = {}
    self.names_types_map = {}
    self.root_table = None

    self.types_transforms = {}
    self.fields_transforms = {}
    self.ordered_tables_keys = []
    self.transaction_groups = []
    self.source_tree = ""
    self.target_tree = ""
    self.metadata = MetaData()
    self.processed_at = datetime.now()

    self._build_model()

`fk_ordered_tables` `property` ¶

Yields tables in create/insert order (tables referenced in foreign keys first)

`fk_ordered_tables_reversed` `property` ¶

Yields tables in drop/delete order (tables referencing foreign keys first)

`create_all_tables(temp=False)` ¶

Create tables for the data model, either target tables or temp tables used to import data.

You do not have to call this method explicitly when using Document.insert_into_target_tables(), which will create tables if they do not exist.

Parameters:

Name	Type	Description	Default
`temp`	`bool`	If `False`, create target tables (unprefixed). If `True`, create temporary (prefixed) tables.	`False`

Source code in xml2db/model.py

def create_all_tables(self, temp: bool = False) -> None:
    """Create tables for the data model, either target tables or temp tables used to import data.

    You do not have to call this method explicitly when using
        [`Document.insert_into_target_tables()`](document.md#xml2db.document.Document.insert_into_target_tables),
        which will create tables if they do not exist.

    Args:
        temp: If `False`, create target tables (unprefixed). If `True`, create temporary (prefixed) tables.
    """
    for tb in self.fk_ordered_tables:
        tb.create_tables(self.engine, temp)

`create_db_schema()` ¶

Create database schema if it does not already exist.

You do not have to call this method explicitly when using Document.insert_into_target_tables().

Source code in xml2db/model.py

def create_db_schema(self) -> None:
    """Create database schema if it does not already exist.

    You do not have to call this method explicitly when using
        [`Document.insert_into_target_tables()`](document.md#xml2db.document.Document.insert_into_target_tables).
    """

    def do_create_schema():
        with self.engine.connect() as conn:
            conn.execute(sqlalchemy.schema.CreateSchema(self.db_schema))
            conn.commit()

    if self.db_schema is not None:
        if self.db_type == "duckdb":
            try:
                do_create_schema()
            except ProgrammingError:
                pass
        else:
            inspector = inspect(self.engine)
            if self.db_schema not in inspector.get_schema_names():
                do_create_schema()

        logger.info(f"Created schema: {self.db_schema}")

`drop_all_tables()` ¶

Drop the data model target (unprefixed) tables.

Danger

BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!

Source code in xml2db/model.py

def drop_all_tables(self):
    """Drop the data model target (unprefixed) tables.

    Danger:
        BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!
    """
    for tb in self.fk_ordered_tables_reversed:
        tb.drop_tables(self.engine)

`drop_all_temp_tables()` ¶

Drop the data model temporary (prefixed) tables.

Danger

BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!

Source code in xml2db/model.py

def drop_all_temp_tables(self):
    """Drop the data model temporary (prefixed) tables.

    Danger:
        BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE!
    """
    for tb in self.fk_ordered_tables_reversed:
        tb.drop_temp_tables(self.engine)

`extract_from_database(root_select_where, force_tz=None)` ¶

Extract a document from the database, based on a where clause applied to the root table. For instance, you can use the column xml2db_input_file_path to filter the data loaded from a specific file.

It will query all the data in the database corresponding to the rows that you select from the root table of your data model. Typically, a single XML file will correspond to a single row in the root table. This function will query the data tree below this record.

This method was not optimized for performance and can be quite slow. It is used in integration tests to check the output against the data inserted into the database.

Parameters:

Name	Type	Description	Default
`root_select_where`	`str`	A where clause to filter the root table of the model, as a string	required
`force_tz`	`Union[str, None]`	Apply this timezone if database returns timezone-naïve datetime	`None`

Returns:

Type	Description
`Document`	A `Document` object containing extracted data

Examples:

Source code in xml2db/model.py

def extract_from_database(
    self,
    root_select_where: str,
    force_tz: Union[str, None] = None,
) -> Document:
    """Extract a document from the database, based on a where clause applied to the root table. For instance, you
        can use the column `xml2db_input_file_path` to filter the data loaded from a specific file.

    It will query all the data in the database corresponding to the rows that you select from the root table of your
        data model. Typically, a single XML file will correspond to a single row in the root table. This function
        will query the data tree below this record.

    This method was not optimized for performance and can be quite slow. It is used in integration tests to check
        the output against the data inserted into the database.

    Args:
        root_select_where: A where clause to filter the root table of the model, as a string
        force_tz: Apply this timezone if database returns timezone-naïve datetime

    Returns:
        A [`Document`](document.md) object containing extracted data

    Examples:

    """
    doc = Document(self)
    doc.extract_from_database(self.root_table, root_select_where, force_tz=force_tz)
    return doc

`get_all_create_index_statements()` ¶

Yield create index statements for all tables

Source code in xml2db/model.py

def get_all_create_index_statements(self) -> Iterable[CreateIndex]:
    """Yield create index statements for all tables"""
    for tb in self.fk_ordered_tables:
        yield from tb.get_create_index_statements()

`get_all_create_table_statements(temp=False)` ¶

Yield sqlalchemy create table statements for all tables

Parameters:

Name	Type	Description	Default
`temp`	`bool`	If `False`, yield create table statements for target tables (unprefixed). If True, yield create table statements for temporary tables (prefixed).	`False`

Source code in xml2db/model.py

def get_all_create_table_statements(
    self, temp: bool = False
) -> Iterable[CreateTable]:
    """Yield sqlalchemy `create table` statements for all tables

    Args:
        temp: If `False`, yield create table statements for target tables (unprefixed). If True, yield create
            table statements for temporary tables (prefixed).
    """
    for tb in self.fk_ordered_tables:
        yield from tb.get_create_table_statements(temp)

`get_entity_rel_diagram(text_context=True)` ¶

Build an entity relationship diagram for the data model

The ERD syntax is used by mermaid.js to create a visual representation of the diagram, which is supported by Pycharm IDE or GitHub in markdown files, among others

Parameters:

Name	Type	Description	Default
`text_context`	`bool`	Should we add a title, a text explanation, etc. or just the ERD?	`True`

Returns:

Type	Description
`str`	A string representation of the ERD

Source code in xml2db/model.py

def get_entity_rel_diagram(self, text_context: bool = True) -> str:
    """Build an entity relationship diagram for the data model

    The ERD syntax is used by mermaid.js to create a visual representation of the diagram, which is supported
    by Pycharm IDE or GitHub in markdown files, among others

    Args:
        text_context: Should we add a title, a text explanation, etc. or just the ERD?

    Returns:
        A string representation of the ERD
    """
    out = ["erDiagram"]
    for tb in self.fk_ordered_tables_reversed:
        out += tb.get_entity_rel_diagram()

    if text_context:
        out = (
            [
                f"# {self.data_flow_long_name}\n",
                f"### Data model name: `{self.data_flow_name}`\n",
                (
                    "The following *Entity Relationships Diagram* represents the target data model, after the "
                    "simplification of the source data model, but before the transformations performed to optimize "
                    "data storage (transformation of `1-1` and `1-n` relationships into `n-1` and `n-n` "
                    "relationships, respectively.\n"
                ),
                (
                    "As a consequence, not all tables of the actual data model used in the database are shown. "
                    "Specifically, `1-n` relationships presented may be stored in the database using an additional "
                    "relationship table (noted with an asterisk in the relationship name).\n"
                ),
                "```mermaid",
            ]
            + out
            + [
                "```",
                (
                    "`-N` suffix in field type indicates that the field can have multiple values, which will be "
                    "stored as comma separated values."
                ),
            ]
        )
    return "\n".join(out)

`parse_xml(xml_file, metadata=None, skip_validation=True, iterparse=True, recover=False, flat_data=None)` ¶

Parse an XML document based on this data model

This method is just a wrapper around the parse_xml method of the Document class.

Parameters:

Name	Type	Description	Default
`xml_file`	`Union[str, BytesIO]`	The path or the file object of an XML file to parse	required
`metadata`	`dict`	A dict of metadata values to add to the root table (a value for each key defined in `metadata_columns` passed to model config)	`None`
`skip_validation`	`bool`	Should we validate the documents against the schema first?	`True`
`iterparse`	`bool`	Parse XML using iterative parsing, which is a bit slower but uses less memory	`True`
`recover`	`bool`	Should we try to parse incorrect XML? (argument passed to lxml parser)	`False`
`flat_data`	`dict`	A dict containing flat data if we want to add data to another dataset instead of creating a new one	`None`

Returns:

Type	Description
`Document`	A parsed `Document` object

Source code in xml2db/model.py

def parse_xml(
    self,
    xml_file: Union[str, BytesIO],
    metadata: dict = None,
    skip_validation: bool = True,
    iterparse: bool = True,
    recover: bool = False,
    flat_data: dict = None,
) -> Document:
    """Parse an XML document based on this data model

    This method is just a wrapper around the parse_xml method of the Document class.

    Args:
        xml_file: The path or the file object of an XML file to parse
        metadata: A dict of metadata values to add to the root table (a value for each key defined in
            `metadata_columns` passed to model config)
        skip_validation: Should we validate the documents against the schema first?
        iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
        recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
        flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
            a new one

    Returns:
        A parsed [`Document`](document.md) object
    """
    doc = Document(self)
    doc.parse_xml(
        xml_file=xml_file,
        metadata=metadata,
        skip_validation=skip_validation,
        iterparse=iterparse,
        recover=recover,
        flat_data=flat_data,
    )
    return doc

DataModel¶

fk_ordered_tables property ¶

fk_ordered_tables_reversed property ¶

create_all_tables(temp=False) ¶

create_db_schema() ¶

drop_all_tables() ¶

drop_all_temp_tables() ¶

extract_from_database(root_select_where, force_tz=None) ¶

get_all_create_index_statements() ¶

get_all_create_table_statements(temp=False) ¶

get_entity_rel_diagram(text_context=True) ¶

parse_xml(xml_file, metadata=None, skip_validation=True, iterparse=True, recover=False, flat_data=None) ¶

`fk_ordered_tables` `property` ¶

`fk_ordered_tables_reversed` `property` ¶

`create_all_tables(temp=False)` ¶

`create_db_schema()` ¶

`drop_all_tables()` ¶

`drop_all_temp_tables()` ¶

`extract_from_database(root_select_where, force_tz=None)` ¶

`get_all_create_index_statements()` ¶

`get_all_create_table_statements(temp=False)` ¶

`get_entity_rel_diagram(text_context=True)` ¶

`parse_xml(xml_file, metadata=None, skip_validation=True, iterparse=True, recover=False, flat_data=None)` ¶