Serializing (nested) data structures in a human-readable format

Question

I am reinventing the wheel to write a function that serializes a (nested) data structure human readably. The default output is deliberately similar to that of json.dumps(var, indent=4), and I did my best to mimic the format of jsons output.

But the output is fundamentally different from json. More specifically, the data types of the dictionary keys are preserved and all data types that can be valid dictionary keys are unchanged. For instance, int keys won't become str and tuple keys are allowed in this format whereas in json tuple keys are impossible. And dictionary keys aren't indented, because I don't use nested data structures as keys.

And all the values and/or elements retain their data types, for instance True, False, None won't become true, false, null.

I built my function using repr, but this is not repr. repr supports all data types defined in one section and the output is not indented, and the data types of the containers themselves are unchanged. But this function doesn't support all data types defined in one section and instead support only the builtin container data types (dict, frozenset, list, set, tuple) and their subclasses. The containers are "promoted" to the data type in the supported data types which they inherited from. Only the data types that aren't considered containers (the data types that can't be nested) are retained. And the output is far more human readable than repr.

I wrote this function to serialize nested dictionaries with int and tuple keys human readably, to store the data on the hard drive, so that I can load them later using ast.literal_eval. I want to allow int keys and I have to store dicts with tuple keys.

Code

from typing import Union

def represent(obj: Union[dict, frozenset, list, set, tuple], indent: int=4) -> str:
    supported = (dict, frozenset, list, set, tuple)
    singles = (frozenset, list, set, tuple)
    if not isinstance(obj, supported):
        raise TypeError('argument `obj` should be an instance of a built-in container data type')
    if not isinstance(indent, int):
        raise TypeError('argument `indent` should be an `int`')
    if indent <= 0:
        raise ValueError('argument `indent` should be greater than 0')
    if indent % 4:
        raise ValueError('argument `indent` should be a multiple of 4')
    ls = list()
    if isinstance(obj, dict):
        start, end = '{}'
        for k, v in sorted(obj.items()):
            if not isinstance(v, supported):
                item = ' '*indent + repr(k) + ': ' + repr(v)
            else:
                item = ' '*indent + repr(k) + ': ' + represent(v, indent+4)
            ls.append(item)
    elif isinstance(obj, singles):
        enclosures = {
            0: ('frozenset({', '})'),
            1: '[]', 2: '{}', 3: '()'
        }
        index = 0
        for i in singles:
            if isinstance(obj, i):
                break
            index += 1
        start, end = enclosures[index]
        if index in (0, 2):
            obj = sorted(obj)
        for i in obj:
            if not isinstance(i, supported):
                item = ' '*indent + repr(i)
            else:
                item = represent(i, indent+4)
            ls.append(item)
    return start + '\n' + ',\n'.join(ls) + '\n' + ' ' * (indent - 4) + end

Example

import json
var = {1: {1: {1: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0},
   2: {1: {1: 0}, 2: 0},
   3: {1: 0}},
  2: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0},
  3: {1: {1: 0}, 2: 0}},
 2: {1: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0},
  2: {1: {1: 0}, 2: 0},
  3: {1: 0}},
 3: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0}}


dumped = json.dumps(var, indent=4)
repred = represent(var)

print('dumped:')
print(dumped)
print('repred:')
print(repred)

print(f'{(eval(dumped) == var)=}')
print(f'{(eval(repred) == var)=}')

var1 = {'a': {'a': {'a': [0, 0, 0], 'b': [0, 0, 1], 'c': [0, 0, 2]},
  'b': {'a': [0, 0, 1], 'b': [0, 1, 1], 'c': [0, 1, 2]},
  'c': {'a': [0, 0, 2], 'b': [0, 1, 2], 'c': [0, 2, 2]}},
 'b': {'a': {'a': [0, 0, 1], 'b': [0, 1, 1], 'c': [0, 1, 2]},
  'b': {'a': [0, 1, 1], 'b': [1, 1, 1], 'c': [1, 1, 2]},
  'c': {'a': [0, 1, 2], 'b': [1, 1, 2], 'c': [1, 2, 2]}},
 'c': {'a': {'a': [0, 0, 2], 'b': [0, 1, 2], 'c': [0, 2, 2]},
  'b': {'a': [0, 1, 2], 'b': [1, 1, 2], 'c': [1, 2, 2]},
  'c': {'a': [0, 2, 2], 'b': [1, 2, 2], 'c': [2, 2, 2]}}}

print(represent(var1))

dumped:
{
    "1": {
        "1": {
            "1": {
                "1": {
                    "1": {
                        "1": 0
                    },
                    "2": 0
                },
                "2": {
                    "1": 0
                },
                "3": 0
            },
            "2": {
                "1": {
                    "1": 0
                },
                "2": 0
            },
            "3": {
                "1": 0
            }
        },
        "2": {
            "1": {
                "1": {
                    "1": 0
                },
                "2": 0
            },
            "2": {
                "1": 0
            },
            "3": 0
        },
        "3": {
            "1": {
                "1": 0
            },
            "2": 0
        }
    },
    "2": {
        "1": {
            "1": {
                "1": {
                    "1": 0
                },
                "2": 0
            },
            "2": {
                "1": 0
            },
            "3": 0
        },
        "2": {
            "1": {
                "1": 0
            },
            "2": 0
        },
        "3": {
            "1": 0
        }
    },
    "3": {
        "1": {
            "1": {
                "1": 0
            },
            "2": 0
        },
        "2": {
            "1": 0
        },
        "3": 0
    }
}
repred:
{
    1: {
        1: {
            1: {
                1: {
                    1: {
                        1: 0
                    },
                    2: 0
                },
                2: {
                    1: 0
                },
                3: 0
            },
            2: {
                1: {
                    1: 0
                },
                2: 0
            },
            3: {
                1: 0
            }
        },
        2: {
            1: {
                1: {
                    1: 0
                },
                2: 0
            },
            2: {
                1: 0
            },
            3: 0
        },
        3: {
            1: {
                1: 0
            },
            2: 0
        }
    },
    2: {
        1: {
            1: {
                1: {
                    1: 0
                },
                2: 0
            },
            2: {
                1: 0
            },
            3: 0
        },
        2: {
            1: {
                1: 0
            },
            2: 0
        },
        3: {
            1: 0
        }
    },
    3: {
        1: {
            1: {
                1: 0
            },
            2: 0
        },
        2: {
            1: 0
        },
        3: 0
    }
}
(eval(dumped) == var)=False
(eval(repred) == var)=True
{
    'a': {
        'a': {
            'a': [
                0,
                0,
                0
            ],
            'b': [
                0,
                0,
                1
            ],
            'c': [
                0,
                0,
                2
            ]
        },
        'b': {
            'a': [
                0,
                0,
                1
            ],
            'b': [
                0,
                1,
                1
            ],
            'c': [
                0,
                1,
                2
            ]
        },
        'c': {
            'a': [
                0,
                0,
                2
            ],
            'b': [
                0,
                1,
                2
            ],
            'c': [
                0,
                2,
                2
            ]
        }
    },
    'b': {
        'a': {
            'a': [
                0,
                0,
                1
            ],
            'b': [
                0,
                1,
                1
            ],
            'c': [
                0,
                1,
                2
            ]
        },
        'b': {
            'a': [
                0,
                1,
                1
            ],
            'b': [
                1,
                1,
                1
            ],
            'c': [
                1,
                1,
                2
            ]
        },
        'c': {
            'a': [
                0,
                1,
                2
            ],
            'b': [
                1,
                1,
                2
            ],
            'c': [
                1,
                2,
                2
            ]
        }
    },
    'c': {
        'a': {
            'a': [
                0,
                0,
                2
            ],
            'b': [
                0,
                1,
                2
            ],
            'c': [
                0,
                2,
                2
            ]
        },
        'b': {
            'a': [
                0,
                1,
                2
            ],
            'b': [
                1,
                1,
                2
            ],
            'c': [
                1,
                2,
                2
            ]
        },
        'c': {
            'a': [
                0,
                2,
                2
            ],
            'b': [
                1,
                2,
                2
            ],
            'c': [
                2,
                2,
                2
            ]
        }
    }
}

I am mainly concerned about performance and memory consumption, and I want the function to execute as fast as possible while utilizing as little RAM as possible. How can it be more efficient?

Update

Actually sorting the items while serializing the dictionaries does introduce bugs that somehow change the data represented, breaking the original association between the key value pairs, and this is definitely not intended.

Removing the sorted calls eliminates the bug, I have fixed my copy of the code but as this is code review and I have received answers I won't edit code posted above (lest the update be rolled back), so I decided to point this out.

More specifically,

This snippet:

if isinstance(obj, dict):
    start, end = '{}'
    for k, v in sorted(obj.items()):
        if not isinstance(v, supported):
            item = ' '*indent + repr(k) + ': ' + repr(v)
        else:
            item = ' '*indent + repr(k) + ': ' + represent(v, indent+4)
        ls.append(item)

MUST be changed to:

if isinstance(obj, dict):
    start, end = '{}'
    for k, v in obj.items():
        if not isinstance(v, supported):
            item = ' '*indent + repr(k) + ': ' + repr(v)
        else:
            item = ' '*indent + repr(k) + ': ' + represent(v, indent+4)
        ls.append(item)

I now think sorting the collections while serializing them to be a terrible practice, but I don't know if this bug also affects nested frozensets too (sets can't be nested because sets are mutable therefore unhashable), I haven't tested yet, but I recommend dropping the sorted calls on frozenset and set too (frozensets can be nested inside sets and frozensets).

Update

Please review the latest version: Serializing (nested) data structures in a human-readable format with all bugs fixed

One small remark: if each level of indent has the same offset (in your case 4 characters per indent), then I would abstract that number of characters away. For example, rename the indent parameter of the respresent function to indentLevel or indentCount or something along those lines, and define a function createIndent(indentLevel: int) -> str which returns ' ' * 4 * indentLevel. (Naming is perhaps not the best yet, but you get the idea). Then replace all ' '*indent in your code by a call to that function, or call the function once and store the result in a variable. — tjalling
– tjalling, Commented Oct 22, 2021 at 8:15

riskypenguin · Accepted Answer · 2021-10-22 15:28:11Z

A few things I noticed:

Since supported is a superset of singles I would recommend the following assignment:

singles = (frozenset, list, set, tuple)
supported = (dict, *singles)

This of course only makes sense if the two are also logically connected which I'd say is true here.

The assignment start, end = '{}' seems unintuitive to me, since string-unpacking isn't commonly used as far as I know.

I'd recommend start, end = '{', '}' if you want to stick to the one-liner or even better:

start = '{'
end = '}'

When traversing the nested data structures you always call your function with indent=indent + 4, regardless of the desired indent by the user. You should probably adjust that to match the user's preference (you might also want to allow indents that are not divisible by 4, e.g. indent-increments of 2 might be sensible for data structures with a lot of layers). For this you'll need to keep track of the indent-step as well as the total indentation amount.

if isinstance(obj, dict):
    start = '{'
    end = '}'
    for k, v in sorted(obj.items()):
        if not isinstance(v, supported):
            item = ' ' * indent + repr(k) + ': ' + repr(v)
        else:
            item = ' ' * indent + repr(k) + ': ' + represent(v, indent + 4)
        ls.append(item)

I don't see a particular reason to sort the key-value-pairs from the dictionary. On the contrary, since dicts in Python now preserve insertion order, this might actually change the data. So I'd drop the call to sorted() if you don't have an explicit reason for it.

I find this pattern

if not condition:
    # do A
else:
    # do B

to usually be better expressed as

if condition:
    # do B
else:
    # do A

or in your case:

for k, v in obj.items():
    if isinstance(v, supported):
        item = ' ' * indent + repr(k) + ': ' + represent(v, indent + 4)
    else:
        item = ' ' * indent + repr(k) + ': ' + repr(v)
    ls.append(item)

This whole snippet

enclosures = {
    0: ('frozenset({', '})'),
    1: '[]',
    2: '{}',
    3: '()'
}

index = 0
for i in singles:
    if isinstance(obj, i):
        break
    index += 1
start, end = enclosures[index]

is unnecessarily complicated, hard to read and rather un-pythonic. I see no particular reason to use indices instead of the types themselves:

enclosures = {
    frozenset: ('frozenset({', '})'),
    list: ('[', ']'),
    set: ('{', '}'),
    tuple: ('(', ')')
}

start, end = enclosures[type(obj)]

I would also remove this from the elif-case and add a dict-entry for dict: ('{', '}'), so you don't need to assign start, end in every case.

EDIT: As correctly pointed out in the comments, this will not work for objects that are instances of subclasses of the supported classes.

The following adaptation will work, while still getting rid of the (otherwise meaningless) indices, thereby making the approach less error-prone:

for cls, enclosure in enclosures.items():
    if isinstance(obj, cls):
        start, end = enclosure
        break

# or

for cls in enclosures.keys():
    if isinstance(obj, cls):
        start, end = enclosures[cls]
        break

Another approach would be using

inspect.getmro(obj.__class__)

which returns a tuple of the object's class and all superclasses (in reverse hierarchical order as far as I can tell).

This code snippet should not check magic numbers:

if index in (0, 2):
    obj = sorted(obj)

Instead you should have a mapping of supported datatypes to a sort flag, which then decides if you sort the collection. Again, you should consider whether you actually need / want to sort these collections.

Most of your string assignments lend themselves nicely to using f-strings:

item = ' ' * indent + repr(k) + ': ' + represent(v, indent + 4)

becomes

item = f"{' ' * indent}{k!r}: {represent(v, indent + 4)}"

As of now you're explicitly checking and handling all supported datatypes in a single function. You might want to consider introducing seperate functions for different (kinds of) data types. This would make your main function more concise and easier to extend.

Your enclosures[type(obj)] will only work if objects are exactly the super types contained in the dict. Many objects are subclassed from the base classes, for instance, Counter and defaultdict objects are instances of dict, but they aren't of the dict type, so trying to get the value using the type directly will very likely raise KeyError. And I did mention I use this function to serialize nested data structures, most frequently nested dictionary, and I always use defaultdict for convenience. — Ξένη Γήινος
– Ξένη Γήινος, Commented Oct 21, 2021 at 11:15

AJNeufeld · Accepted Answer · 2021-10-21 16:28:02Z

5

Bugs

Sets

Sets and dictionaries use identical delimiters: { ... }. Dictionaries are distinguished from sets by the presence of colon (:) separated key-value pairs.

Problem: Both an empty set and an empty dictionary are rendered by your code as {}.

>>> eval(represent(set()))
{}

>>> eval(represent(set())) == set()
False

Tuples

A tuple is a comma-separated list of values in parenthesis.

Problem: (1) is a Python expression evaluating to 1, not a tuple; the Python syntax requires a trailing comma (1,) to ensure this is interpreted as a tuple.

eval(represent((1,)))
1

eval(represent((1,))) == (1,)
False

Conclusion

Your test cases are large, complex objects, but are missing important edge cases like:

empty sets,
empty tuples,
single element tuples,
string elements containing single & double quotes, backslashes,
...

edited Oct 21, 2021 at 16:28

answered Oct 21, 2021 at 16:19

AJNeufeld

35.3k5 gold badges41 silver badges103 bronze badges

\$\begingroup\$ Well, I have to admit these things do have a potential to cause problems, they can only cause minor problems. I can't imagine why anyone would want to use my function to dump those extremely simple objects as strings rather than use those short expressions directly, I intend my function to dump complex objects and it does that well, and clearly I can't cover all possible use cases... \$\endgroup\$

Ξένη Γήινος
– Ξένη Γήινος

2021-10-21 16:57:03 +00:00
Commented Oct 21, 2021 at 16:57
3

\$\begingroup\$ A complex object may contain extremely simple elements, such as a dictionary of sets, where one of those sets happens to be empty, or a dictionary of tuples where some of those tuples might be of length 1. In these cases an empty set would become an empty dictionary, which is not a problem if the recovered value is only read/tested and never modified. The single element tuple becoming just the element is a problem, since a level of nesting has vanished, so code reading in the complex object will break using that element of the complex object. \$\endgroup\$

AJNeufeld
– AJNeufeld

2021-10-21 17:06:21 +00:00
Commented Oct 21, 2021 at 17:06
4

\$\begingroup\$ ...clearly I can't cover all possible use cases.... Those are not different use cases, but edge cases (as @AJNeufeld points out in his answer). Designing for certain use cases and therefore ignoring other cases is fine, while ignoring edge cases can (and will) usually lead to problems in most use cases. \$\endgroup\$

riskypenguin
– riskypenguin

2021-10-21 17:09:52 +00:00
Commented Oct 21, 2021 at 17:09
2

\$\begingroup\$ Simple test cases are designed to ensure a failure in an edge case is understandable. I could have given var = {(1,2): ({"doesn't work": set()},)}, and eval(represent(var)) == var would return False, but it wouldn't be clear what the problem was. Is it the tuple used as the dictionary key, the tuple value, the string key with the embedded single quote, or the set value? len(var[1,2][0]["doesn't work"]) is a valid expression, but with res = eval(represent(var)) the equivalent expression len(res[1,2][0]["doesn't work"]) raises an exception: bonus points if you can guess exactly what. \$\endgroup\$

AJNeufeld
– AJNeufeld

2021-10-21 17:40:03 +00:00
Commented Oct 21, 2021 at 17:40

Add a comment |

Stack Exchange Network

Serializing (nested) data structures in a human-readable format

Code

Example

Update

Update

2 Answers 2

Bugs

Sets

Tuples

Conclusion

You must log in to answer this question.

Linked

Hot Network Questions

Serializing (nested) data structures in a human-readable format

Code

Example

Update

Update

2 Answers 2

Bugs

Sets

Tuples

Conclusion

You must log in to answer this question.

Linked

Related

Hot Network Questions