I work on a code-base that uses xml to set up problems and specify model parameters. I've created a script that I run in tandem with our code. This script will store important model information parsed from the most recent xml file and eventually end up in a LaTeX document. This script will help me keep track of model parameters I've tried and aid in reproducibility.
One problem I've come across is that, as I change model parameters, certain nodes will be deleted from the xml file and cause my script to crash. Instead, I've created a solution that will attempt to parse what I want, but if it doesn't find it, it will just return an empty dictionary.
This leads me to merging a bunch of dictionaries and I'm not quite sure this is the most idiomatic/efficient way. For this code-review I would like any feedback on how to approach this problem better, plus any styling or formatting suggestions.
Here is a sample xml file ./low_tax/il_train.xml:
<Simulation>
  <Models>
    <ROM name="arma" subType="ARMA">
      <P>2</P>
      <Q>2</Q>
      <Fourier>8760, 2190, 168, 24, 12, 8, 6, 3</Fourier>
      <Segment grouping='interpolate'>
        <subspace pivotLength='168' shift='zero'>HOUR</subspace>
      </Segment>
    </ROM>
    <PostProcessor>
      <KDD>
        <Features>TOTALLOAD</Features>
        <SKLtype>cluster|KMeans</SKLtype>
        <n_clusters>12</n_clusters>
      </KDD>
    </PostProcessor>
  </Models>
  <Samplers>
    <MonteCarlo>
      <samplerInit>
        <limit>8</limit>
        <initialSeed>42</initialSeed>
      </samplerInit>
    </MonteCarlo>
  </Samplers>
</Simulation>
Here is the python code:
#!/usr/bin/env python3
from pathlib import Path
import xml.etree.cElementTree as ET
from datetime import datetime
def search_node(root: ET.Element, node: str, children: list) -> dict:
    """
    Return dictionary containing information requested from node children.
    @In: root, ET.Element, root node of xml tree.
    @In: node, str, a string containing xpath to parent node of interest.
    @In: children, list, a list of expected children nodes.
    @Out: dict, a dictionary containing retrieved information for node.
    """
    node_str = node + "/{child}"
    values = {
        # This information will be placed in LaTeX table;
        # Therefore, we need to preemptively escape underscores.
        k.replace("_", "\_"): root.findtext(node_str.format(child=k)) for k in children
    }
    return values
def parse_xml(xml_file: Path) -> dict:
    """
    Parse model information from xml file.
    @In: xml_file, Path, path to current specified xml_file.
    @Out: dict, a dictionary of information parsed from xml.
    """
    root = ET.parse(xml_file).getroot()
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-7]
    # Information parsed from xml file.
    case_info = {
        "state": xml_file.name.split("_")[0].upper(),
        "strategy": xml_file.resolve().parent.name,
    }
    model_info = search_node(root, "Models/ROM", ["P", "Q", "Fourier"])
    model_info = {**model_info, **root.find("Models/ROM/Segment/subspace").attrib}
    pp_info = search_node(
        root,
        "Models/PostProcessor/KDD",
        ["SKLtype", "n_clusters", "tol", "random_state"],
    )
    samp_info = search_node(root, "Samplers/MonteCarlo/samplerInit", ["limit"])
    misc_info = {"created": now}
    # Merge all dictionaries
    # This should allow us to not fail on missing nodes
    info_dict = {**case_info, **model_info, **pp_info, **samp_info, **misc_info}
    # Drop any keys with None values to filter the table
    filtered = {k: v for k, v in info_dict.items() if v is not None}
    info_dict.clear()
    info_dict.update(filtered)
    return info_dict
if __name__ == "__main__":
    xml_file = Path("./low_tax/il_train.xml").resolve()
    model_info = parse_xml(xml_file)
    print(model_info)
    
@In/@Outtype parameter description? \$\endgroup\$