Handle missing children nodes when parsing XML into a dictionary

Question

I work on a code-base that uses xml to set up problems and specify model parameters. I've created a script that I run in tandem with our code. This script will store important model information parsed from the most recent xml file and eventually end up in a LaTeX document. This script will help me keep track of model parameters I've tried and aid in reproducibility.

One problem I've come across is that, as I change model parameters, certain nodes will be deleted from the xml file and cause my script to crash. Instead, I've created a solution that will attempt to parse what I want, but if it doesn't find it, it will just return an empty dictionary.

This leads me to merging a bunch of dictionaries and I'm not quite sure this is the most idiomatic/efficient way. For this code-review I would like any feedback on how to approach this problem better, plus any styling or formatting suggestions.

Here is a sample xml file ./low_tax/il_train.xml:

<Simulation>
  <Models>
    <ROM name="arma" subType="ARMA">
      <P>2</P>
      <Q>2</Q>
      <Fourier>8760, 2190, 168, 24, 12, 8, 6, 3</Fourier>
      <Segment grouping='interpolate'>
        <subspace pivotLength='168' shift='zero'>HOUR</subspace>
      </Segment>
    </ROM>
    <PostProcessor>
      <KDD>
        <Features>TOTALLOAD</Features>
        <SKLtype>cluster|KMeans</SKLtype>
        <n_clusters>12</n_clusters>
      </KDD>
    </PostProcessor>
  </Models>
  <Samplers>
    <MonteCarlo>
      <samplerInit>
        <limit>8</limit>
        <initialSeed>42</initialSeed>
      </samplerInit>
    </MonteCarlo>
  </Samplers>
</Simulation>

Here is the python code:

#!/usr/bin/env python3
from pathlib import Path
import xml.etree.cElementTree as ET
from datetime import datetime


def search_node(root: ET.Element, node: str, children: list) -> dict:
    """
    Return dictionary containing information requested from node children.

    @In: root, ET.Element, root node of xml tree.
    @In: node, str, a string containing xpath to parent node of interest.
    @In: children, list, a list of expected children nodes.
    @Out: dict, a dictionary containing retrieved information for node.
    """
    node_str = node + "/{child}"
    values = {
        # This information will be placed in LaTeX table;
        # Therefore, we need to preemptively escape underscores.
        k.replace("_", "\_"): root.findtext(node_str.format(child=k)) for k in children
    }
    return values


def parse_xml(xml_file: Path) -> dict:
    """
    Parse model information from xml file.

    @In: xml_file, Path, path to current specified xml_file.
    @Out: dict, a dictionary of information parsed from xml.
    """
    root = ET.parse(xml_file).getroot()
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-7]

    # Information parsed from xml file.
    case_info = {
        "state": xml_file.name.split("_")[0].upper(),
        "strategy": xml_file.resolve().parent.name,
    }
    model_info = search_node(root, "Models/ROM", ["P", "Q", "Fourier"])
    model_info = {**model_info, **root.find("Models/ROM/Segment/subspace").attrib}
    pp_info = search_node(
        root,
        "Models/PostProcessor/KDD",
        ["SKLtype", "n_clusters", "tol", "random_state"],
    )
    samp_info = search_node(root, "Samplers/MonteCarlo/samplerInit", ["limit"])
    misc_info = {"created": now}

    # Merge all dictionaries
    # This should allow us to not fail on missing nodes
    info_dict = {**case_info, **model_info, **pp_info, **samp_info, **misc_info}

    # Drop any keys with None values to filter the table
    filtered = {k: v for k, v in info_dict.items() if v is not None}
    info_dict.clear()
    info_dict.update(filtered)
    return info_dict


if __name__ == "__main__":
    xml_file = Path("./low_tax/il_train.xml").resolve()
    model_info = parse_xml(xml_file)
    print(model_info)

which docs parser do you use for the @In/@Out type parameter description? — hjpotter92
– hjpotter92, Commented Oct 14, 2020 at 20:37
I'm pretty sure it's supposed to be doxygen but our documentation tools have fallen severely behind. I think it broke a while back and we haven't had the funding to fix it. A mess I know :/ — dylanjm
– dylanjm, Commented Oct 14, 2020 at 20:44

l0b0 · Accepted Answer · 2020-10-14 22:49:05Z

Nice, type hints! One thing I might start with, after working with a strict mypy configuration, is to make the type declarations stricter, and then enforce that strictness. For example, a type of list is equivalent to list[Any]. In the case of the children parameter you know more than that: it's list[str]. mypy has an option to disallow "any" generics.

In the same vein you can use TypedDict to specify the types of the contents of your dicts. You can specify total=False if some of the entries in the dict are optional.

f-strings are the recommended way to create strings mixing literals and variable values. For example, node_str = node + "/{child}" would be written node_str = f"{node}/{{child}}".

Single letter variables should be avoided in general. k is used in different places with different meanings; in the first place it should probably be child.

strftime("%Y-%m-%d %H:%M:%S.%f")[:-7] can be simplified to strftime("%Y-%m-%d %H:%M:%S").

The if v is not None filter should probably be in the search_node function. That way you won't have to do the whole filling a variable, copying out the non-None values and replacing the variable rigmarole.

I would probably replace node_str with something like child_xpath, and children with something like child_element_names, for clarity.

RootTwo · Accepted Answer · 2020-10-15 20:14:54Z

The effect of merging the dicts, without actually creating a new dict, can be done using collections.ChainMap.

Rather than clearing info_dict, updating it from filtered and returning it, just return filtered.

from collections import ChainMap

# Merge all dictionaries
# a key gets it's value from earlier dicts
info_dict = ChainMap(misc_info, samp_info, pp_info, model_info, case_info)

# Drop any keys with None values to filter the table
filtered = {k: v for k, v in info_dict.items() if v is not None}
return filtered

Stack Exchange Network

Handle missing children nodes when parsing XML into a dictionary

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Handle missing children nodes when parsing XML into a dictionary

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions