Translating dictionary keys in complex nested Python structures

Question

This is an actual work problem we had to solve. Put simply: given a structure (e.g. nested dictionaries) and a mapping of old dictionary keys to new ones, produce a new structure that is anatomically identical to the original, uses the new dictionary keys, and preservers every other value.

How to encode this mapping?
How to go about the translation?

Context

We receive these dictionaries in the form of json files through an API and, because of extraneous constraints, the sender doesn't have access to our internal nomenclature system. So we need to convert the names ourselves.

Assembling the mappings is quite a laborious manual effort, as it envolves figuring out semantics and talking to people. We are obviously working on a better solution, but these contraints will hold us for a while longer.

Details

Suppose a system which receives json messages such as

msg = {
    "id": 1,
    "summary": {
        "origin": {
            "url": "url",
            "slug": "slug"
        },
        "tags": ["a", "b"]
    },
    "items": [
        {
            "id": "abc",
            "price": 50
        },
        {
            "id": "def",
            "price": 110,
            "discount": 50
        }   
    ]
}

But in order to move the data forward, the names of the dictionary keys must follow a specific nomenclature. So they must be translated, like so:

translated_msg = {
    "IDENTIF": 1,
    "SUMM": {
        "ORIG": {
            "WEBADDRESS": "url",
            "LOCATOR": "slug"
        },
        "TAGS": ["a", "b"]
    },
    "PURCHASEDGOODS": [
        {
            "GOODSID": "abc",
            "GOODSPRICE": 50
        },
        {
            "GOODSID": "def",
            "GOODSPRICE": 110,
            "GIVENDISCOUNT": 50
        }   
    ]
}

The new terminology comes from a translation dictionary that has to be manually built by someone who is familiar with the data and the nomenclature to be followed. This field map must also encode the anatomy of the original structure because there may be multiple fields with the same name but in different depths. Notice the two id fields above.

Solution

With all this in mind, here is a field map structure which fits the criteria. Its syntax is part of the solution I came up with and can be modified.

field_map = {
    "/id": "IDENTIF",
    "/summary": "SUMM",
    "/summary/origin": "ORIG",
    "/summary/origin/url": "WEBADDRESS",
    "/summary/origin/slug": "LOCATOR",
    "/summary/tags": "TAGS",
    "/items": "PURCHASEDGOODS",
    "/items//id": "GOODSID",
    "/items//price": "GOODSPRICE",
    "/items//discount": "GIVENDISCOUNT",
}

Notice /items//discount has two slashes in the middle. Slashes represent going deeper one level within the structure.

Inspired by https://stackoverflow.com/a/40857703/10504841, here is a recursive solution that, given a structure and a field map, walks through the entire structure and builds a translated copy:

from typing import Iterable, Union

def is_valid_iterable(struct):
    return isinstance(struct, Iterable) and not isinstance(
        struct, (str, bytes)
    )

def is_key_in_dict(key, dict_):
    try:
        _ = dict_[key]
        return True
    except KeyError:
        return False

def translate_nested_structure(
    structure: Union[dict, list, tuple], trans_dict: dict, prefix: str = ""
) -> Union[dict, list, tuple]:
    """
    Translate dictionary keys in a nested structure using a translation
    dictionary. Maintains the same strucutre and primitive values.
    Useful for translating jsons and avro messages

    If a key is present in the structure but not in the translation dictionary,
    it is understood as undesired and removed from the output structure

    If a (sub)structure is made of only lists or tuples, the output
    is simply a copy of the given (sub)structure

    Supported types and content limitation for dictionary (sub)structures
    ------------------------------------------------------
    Key fields can be of any primitive type or None.
    Tuple keys are somewhat supported, but not fully tested and not documented.
    "/" are not allowed inside string keys, see translation dictionary syntax

    Value field can be lists, tuples, dicts, any primitive or None

    Translation dictionary syntax
    ------------------------------
    The translation dictionary must capture the anatomy of the nested
    structure, as different nested keys may share the same name.

    The syntax for the translation dictionary keys is made of
    "/"s and `orig_key`s.

    "/" are used to indicate going deeper whithin the strucure,
    so "/" may not be present inside string keys in the structure.
    Also, the number of preceding "/" should match the nesting level
    of the (sub)structure

    `orig_key` are pieces of string which contain
    the name of the specified original key in the structure.

    The syntax for the keys is easier to understand if thought of backwards:
    every key must end with an `orig_key`, since those are
    what need to be translated. A single preceding "/"
    indicates `orig_key` is key a inside another dicionary
    (e.g. "/start/in_a_dict`). In this case,
    unless `orig_key` is the first key (e.g. "/test"), then "/"
    must be preced by another `orig_key (e.g. "/start/test`).
    Multiple preceding "/" indicate `orig_key` is in a
    list or tuple (e.g. "/start//in_a_list", "//start").

    Since the translation dictionary values contain the desired
    new translated (sub)structure keys, the syntax and supported types are
    the same as the original structure syntax for keys. See above

    Parameters
    ----------
    structure: [dict | list | tuple]
        Nested dict, list or tuple.
    trans_dict: dict
        Translation dictionary, see example below.
    prefix: str
        Prefix used to find keys in the translation dictionary, leave blank

    Returns
    -------
    translated_structure: [dict, list, tuple]
        Same structure, but with translated dictionary keys

    Examples
    --------
    >>> sample_msg = {
    ...     "a": {
    ...         "b": ["c", "d"],
    ...         "e": [
    ...             {
    ...                 "f": {"g": "h"},
    ...             },
    ...             {
    ...                 "f": {"g": "h", "g2": "h2"},
    ...             },
    ...         ],
    ...         "i": None,
    ...         "j": [],
    ...     },
    ... }

    >>> sample_translated_msg = {
    ...     "aaaa": {
    ...         "bbbb": ["c", "d"],
    ...         "eeee": [
    ...             {
    ...                 "ffff": {"gggg": "h"},
    ...             },
    ...             {
    ...                 "ffff": {"gggg": "h", "gggg2222": "h2"},
    ...             },
    ...         ],
    ...         "iiii": None,
    ...         "jjjj": [],
    ...     },
    ... }

    >>> sample_field_map = {
    ...     "/a": "aaaa",
    ...     "/a/b": "bbbb",
    ...     "/a/e": "eeee",
    ...     "/a/e//f": "ffff",
    ...     "/a/e//f/g": "gggg",
    ...     "/a/e//f/g2": "gggg2222",
    ...     "/a/i": "iiii",
    ...     "/a/j": "jjjj",
    ... }

    >>> translated_msg = translate_nested_structure(
    ...         sample_msg, sample_field_map
    ...     )
    >>> translated_msg == sample_translated_msg
    True

    TODO
    ----
    - Improve the trans dict syntax?

    """

    def translate_dict(dict_struct, trans_dict, prefix=""):
        if not isinstance(dict_struct, dict):
            raise TypeError("Expect dict, received %s", type(dict_struct))

        new_dict = dict()
        for key, value in dict_struct.items():
            new_prefix = "/".join([prefix, str(key)])
            if not is_key_in_dict(new_prefix, trans_dict):
                continue

            new_key = trans_dict[new_prefix]
            if is_valid_iterable(value):
                new_value = translate_nested_structure(
                    value, trans_dict, new_prefix
                )
            else:
                new_value = value
            new_dict[new_key] = new_value
        return new_dict

    def translate_simple_struct(simple_struct, trans_dict, prefix=""):
        if not isinstance(simple_struct, (list, tuple)):
            raise TypeError(
                "Expect list or tuple, received %s", type(simple_struct)
            )

        cls_ = type(simple_struct)
        new_simple_struct = cls_([])
        for item in simple_struct:
            new_prefix = "/".join([prefix, ""])
            if is_valid_iterable(item):
                new_item = translate_nested_structure(
                    item, trans_dict, new_prefix
                )
            else:
                new_item = item
            new_simple_struct += cls_([new_item])
        return new_simple_struct

    if isinstance(structure, dict):
        return translate_dict(structure, trans_dict, prefix)
    else:
        return translate_simple_struct(structure, trans_dict, prefix)

About tuples as dicitonary keys. I tested a bit and it is possible to encode tuples in the current version of the field map encoding, but the syntax can become quite complicated, so I decided to leave them out for now. The encoding should be as human friendly as possible.

What are your thoughts on the code itself?
Do you have any suggestions on how to improve the encoding syntax?
What about increasing the level of abstraction and supporting more structures, such as sets, classes or custom Iterables?
I'd also like to hear if other people face similar problems. How often, it at all, do people need to translate dictionary keys like this?

Few things seem to be missing, at least translate_nested_structure and is_valid_iterable. — vnp
– vnp, Commented Jun 17, 2022 at 19:58
@Reinderien This is an actual work problem we are facing, I just tried to summarize and I guess it got confusing. We receive this data in json files through an API and, because of extraneous constraints, the sender doesn't have access to our internal nomenclature so we need to convert the names ourselves. Assembling the mappings is quite an effort, as it envolves figuring out semantics and talking to people. We are obviously working on a better solution, but this solution will have to do for some time — pbsb
– pbsb, Commented Jun 17, 2022 at 21:54
yes, I came up with that syntax and it can be modified. I edited the question to include the clarifications brought up so far — pbsb
– pbsb, Commented Jun 17, 2022 at 22:09

Reinderien · Accepted Answer · 2022-06-18 01:26:27Z

How to encode this mapping?

Not the way you've done it, I think. Zen says explicit is better than implicit, and your current mapping is highly implicit. You have a magic double-slash to indicate a list level, and you have an O(n²) problem with your key expressions. These are avoidable problems: don't think of your mapping as being flat, over-the-wire JSON data; think of it as well-typed, well-structured in-memory data. There's no reason for you to write a parsing layer if you don't need it.

Aside: translating from one dict-lasagna domain to another is evidence of a broader, more severe problem with lack of good models (or perhaps no models at all), but you have not shown enough other code for this to be talked about meaningfully.

If what you say is true and these data come directly from JSON, then you need to drop the code that cares about tuples because these will never happen.

Picking up on a few granular review issues (though perhaps these are moot since I'm suggesting that you throw all of the existing code away):

is_valid_iterable should only isinstance(struct, (dict, list))
is_key_in_dict needs to die, and the call needs to be replaced with key in some_dict

Suggested

A re-thought mapping could make use of simple polymorphism, with nary an isinstance in sight:

from dataclasses import dataclass, field
from typing import Any, Union, Optional

Payload = Union[dict[str, Any], list[Any]]


@dataclass
class Node:
    replacement: Optional[str] = None

    def translate(self, structure: Payload) -> Payload:
        return structure


@dataclass
class DictNode(Node):
    nodes: dict[str, 'Node'] = field(default_factory=dict)

    def translate(self, structure: Payload) -> Payload:
        translated = {}
        for key, value in structure.items():
            translator = self.nodes.get(key)
            if translator:
                key = translator.replacement or key
                value = translator.translate(value)
            translated[key] = value
        return translated


class ListNode(DictNode):
    def translate(self, structure: Payload) -> Payload:
        return [
            super(ListNode, self).translate(item)
            for item in structure
        ]


def test() -> None:
    from pprint import pprint

    msg = {
        'id': 1,
        'items': [{'id': 'abc', 'price': 50},
                  {'discount': 50, 'id': 'def', 'price': 110}],
        'summary': {'origin': {'slug': 'slug', 'url': 'url'}, 'tags': ['a', 'b']}
    }

    field_map = DictNode(nodes={
        'id': Node('IDENTIF'),
        'summary': DictNode('SUMM', {
            'origin': DictNode('ORIG', {
                'url': Node('WEBADDRESS'),
                'slug': Node('LOCATOR'),
            }),
            'tags': Node('TAGS'),
        }),
        'items': ListNode('PURCHASEDGOODS', {
            'id': Node('GOODSID'),
            'price': Node('GOODSPRICE'),
            'discount': Node('GIVENDISCOUNT'),
        }),
    })

    pprint(field_map.translate(msg))


if __name__ == '__main__':
    test()

Output

{'IDENTIF': 1,
 'PURCHASEDGOODS': [{'GOODSID': 'abc', 'GOODSPRICE': 50},
                    {'GIVENDISCOUNT': 50, 'GOODSID': 'def', 'GOODSPRICE': 110}],
 'SUMM': {'ORIG': {'LOCATOR': 'slug', 'WEBADDRESS': 'url'}, 'TAGS': ['a', 'b']}}

lukstru · Accepted Answer · 2022-06-17 20:41:33Z

3

I'd also hear if other people face problems similar to these. How often do people need to translate dictionary keys like this?

I'd say it's very unusual. In my experience, such dicts are either constructed by jsons or similar to give users / admins a friendly way to script without having any programming knowledge - and internally changing the keys makes no sense, except to increase complexity.

The other way dicts are used is in a programming context where associated data must be stored together. In this context, mostly constants are used as the keys, or input that stays constant. Again, internally changing the keys makes no sense.

The one purpose that I would see dicts used in this way is when the dict is used as a control mechanism, similar to a script engine but completely defined and used by developers. It can make some actual code look extremely neat and clean, however in my opinion it goes against the principle to make code explicit - and therefore decreases readability and understandability.

answered Jun 17, 2022 at 20:41

lukstru

1,0484 silver badges18 bronze badges

\$\begingroup\$ I see your point but I am not familiar with script engines. Can you point me in the right direction? This is what comes up when I google it stackoverflow.com/q/1691201/10504841 and docs.oracle.com/javase/7/docs/api/javax/script/… \$\endgroup\$

pbsb
– pbsb

2022-06-17 22:19:26 +00:00
Commented Jun 17, 2022 at 22:19
\$\begingroup\$ I think I also failed to contextualize properly. This is an actual work problem we had to solve. I added some more details to the question \$\endgroup\$

pbsb
– pbsb

2022-06-17 22:20:16 +00:00
Commented Jun 17, 2022 at 22:20
\$\begingroup\$ @pbsb I don't know the actual terminology, but I had a project once where I noticed that the behaviour of my program was highly dependent and similar to the data I input and my configs. So much that I had lambdas in dicts and could 'script' entire behaviours with JSON files only. The code mostly worked with the dicts to transition between states and execute calls to other programs that were defined in said JSON. I named it script engine since it took in some 'script' - JSON - and executed code dependent on it. Not entirely like an interpreter, but a sized down, specialized version. \$\endgroup\$

lukstru
– lukstru

2022-06-18 12:39:54 +00:00
Commented Jun 18, 2022 at 12:39
\$\begingroup\$ TLDR: A sized down, specialized version of an interpreter. \$\endgroup\$

lukstru
– lukstru

2022-06-18 12:40:10 +00:00
Commented Jun 18, 2022 at 12:40
\$\begingroup\$ And to point you in the right direction, I'd suggest learning more about compilers (and interpreters, but they're included in compilers). We had very good courses in university, but I don't know how to get that good information outside university. The course was called introduction to compiler construction and conveyed the basics very well. EDIT: don't know how much they fit your problem though, I don't think it's what you're searching for. Doesn't hurt though, it was fun getting to know the magic behind compilers! \$\endgroup\$

lukstru
– lukstru

2022-06-18 12:43:08 +00:00
Commented Jun 18, 2022 at 12:43

Add a comment |

FMc · Accepted Answer · 2022-06-18 18:35:11Z

How certain are you that you will only need key renaming? If this is a real project, your current needs are likely a simplification of your eventual needs. That's just how living software projects behave: you need something, you build something, and the experience with that built thing causes you to need other or different things. Currently, you seem to be performing a simple task: preserving the structural characteristics of the data while renaming the keys. What is the probability that you will need other things in the future: for example, value conversion (eg, int to float) or full-blown data restructuring?

Your need is not novel: do more research to learn how others have dealt with the problem. The Python ecosystem has libraries to perform different kinds of data remappings: here is one called jsonbender. I've never used it and cannot comment on its quality, but a quick scan through the README points to some issues you might want to consider -- notably, dealing with lists, configuring optionality, and building in support for callables to handle computation needs than cannot be easily expressed via a simple configuration syntax (in my own professional experience, that latter has been especially powerful on projects having some overlap with your needs).

Your implementation seems backwards and is thus too limiting. Like one of the reviews, your remapping (in field_map) strikes me as backwards: it maps old paths/keys to new paths/keys. But that is limiting because it provides no mechanism for controlling the output structure. It also seems less intuitive than the alternative -- namely, declaring the structure you want and then, at the leaf nodes, defining how/where to retrieve values from the source. I would encourage you to define the remapping from the perspective of the desired data. For example, if we focus just on the IDENTIF and SUMM keys (plus a FOO key added for illustration), one could define a remapping as follows. Each leaf value can be obtained by diving down though the hierarchy based on the keys declared in each tuple. Even though this example handles only the easy situations in your current problem, it does illustrate -- at least to my eye -- the intuitiveness of defining the remapping from the perspective of the desired output, as well as its greater flexibility in terms of data restructuring, should that need ever arise.

remapping = {
    # Simple dict-to-dict key renaming via data-diving tuples.
    "IDENTIF": ('id',),
    "SUMM": {
        "ORIG": {
            "WEBADDRESS": ('summary', 'origin', 'url'),
            "LOCATOR": ('summary', 'origin', 'slug'),
        },
        "TAGS": ('summary', 'tags'),
    },
    # Restructuring and even reuse of source nodes is possible.
    "FOO": {
        "BAR": ('id',),
    },
}

Dealing with pesky lists. That simple plan falters when it comes to lists. Your workaround was a double-slash convention and one reviewer suggests a using explicit types like Node, DictNode, and ListNode to configure the needed remappings. A middle-ground is to continue with the simplicity of your convention-based approach but to make it a bit more rigorous. The illustration above relies on the convention that a dict in the remapping configuration produces a dict in the output data. We could do the same with lists. The example below would be interpreted as follows: PURCHASEDGOODS will hold a list; we obtain the source data for that list from the key(s) declared inside the list; and the final element of the configuration-list will contain the specification for how to build individual values composing the list. I'm not necessarily advocating this approach, but it does illustrate a low-tech, convention-based approach with greater intuitiveness and flexibility than your current idea.

remapping = {
    ...
    "PURCHASEDGOODS": ['items',
        {
            "GOODSID": ('id',),
            "GOODSPRICE": ('price',),
            "GIVENDISCOUNT": ('discount',),
        }
    ],
}

Making that approach a bit more formal via explicit types. Another middle-ground is something like the following. It still relies on some conventional behavior relating to dicts, but it does have explicit types to distinguish the two primary ways to retrieve data from the source: (1) simple data-diving via a tuple of keys or (2) data-diving over a source list to produce an output list. One benefit of at least adding two types like these is that they provide a mechanism to configure optionality: for example, Diver('discount', default = 0). It would also provide a way to pass in callables to handle more complex needs or even simple value-conversion behavior you might want in the future: for example, Diver('discount', default = 0, convert = float).

remapping = {
    "IDENTIF": Diver('id'),
    "SUMM": {
        "ORIG": {
            "WEBADDRESS": Diver('summary', 'origin', 'url'),
            "LOCATOR": Diver('summary', 'origin', 'slug'),
        },
        "TAGS": Diver('summary', 'tags'),
    },
    "PURCHASEDGOODS": ListDiver('items',
        {
            "GOODSID": Diver('id'),
            "GOODSPRICE": Diver('price'),
            "GIVENDISCOUNT": Diver('discount'),
        },
    ],
}

Other possibilities. The next obvious extension is to formalize the dict-related configuration more explicitly (eg DictDiver). Whether that's worth the trouble depends on your expectations for the future of the project. To my mind, that step seems the least compelling: at a certain point, every project must adopt a variety of conventions and it's no crime to embrace them if they are intuitive and reasonable. If you were to take that step, you would end up with an approach similar to the substantive review you already have, but with the reversed orientation discussed above. Finally, I'll re-emphasize the recommendation to research other libraries that perform this kind of data conversion. Even if you end up adopting a low-tech, convention-based solution, your decision-making should be guided by how others have thought about this topic. And you might get lucky and find a library that already does exactly what you need.

Stack Exchange Network

Translating dictionary keys in complex nested Python structures

Context

Details

Solution

3 Answers 3

Suggested

Output

You must log in to answer this question.

Hot Network Questions

Translating dictionary keys in complex nested Python structures

Context

Details

Solution

3 Answers 3

Suggested

Output

You must log in to answer this question.

Related

Hot Network Questions