I am reinventing the wheel to write a function that serializes a (nested) data structure human readably. The default output is deliberately similar to that of json.dumps(var, indent=4), and I did my best to mimic the format of jsons output.
But the output is fundamentally different from json. More specifically, the data types of the dictionary keys are preserved and all data types that can be valid dictionary keys are unchanged. For instance, int keys won't become str and tuple keys are allowed in this format whereas in json tuple keys are impossible. And dictionary keys aren't indented, because I don't use nested data structures as keys.
And all the values and/or elements retain their data types, for instance True, False, None won't become true, false, null.
I built my function using repr, but this is not repr. repr supports all data types defined in one section and the output is not indented, and the data types of the containers themselves are unchanged. But this function doesn't support all data types defined in one section and instead support only the builtin container data types (dict, frozenset, list, set, tuple) and their subclasses. The containers are "promoted" to the data type in the supported data types which they inherited from. Only the data types that aren't considered containers (the data types that can't be nested) are retained. And the output is far more human readable than repr.
I wrote this function to serialize nested dictionaries with int and tuple keys human readably, to store the data on the hard drive, so that I can load them later using ast.literal_eval. I want to allow int keys and I have to store dicts with tuple keys.
Code
from typing import Union
def represent(obj: Union[dict, frozenset, list, set, tuple], indent: int=4) -> str:
supported = (dict, frozenset, list, set, tuple)
singles = (frozenset, list, set, tuple)
if not isinstance(obj, supported):
raise TypeError('argument `obj` should be an instance of a built-in container data type')
if not isinstance(indent, int):
raise TypeError('argument `indent` should be an `int`')
if indent <= 0:
raise ValueError('argument `indent` should be greater than 0')
if indent % 4:
raise ValueError('argument `indent` should be a multiple of 4')
ls = list()
if isinstance(obj, dict):
start, end = '{}'
for k, v in sorted(obj.items()):
if not isinstance(v, supported):
item = ' '*indent + repr(k) + ': ' + repr(v)
else:
item = ' '*indent + repr(k) + ': ' + represent(v, indent+4)
ls.append(item)
elif isinstance(obj, singles):
enclosures = {
0: ('frozenset({', '})'),
1: '[]', 2: '{}', 3: '()'
}
index = 0
for i in singles:
if isinstance(obj, i):
break
index += 1
start, end = enclosures[index]
if index in (0, 2):
obj = sorted(obj)
for i in obj:
if not isinstance(i, supported):
item = ' '*indent + repr(i)
else:
item = represent(i, indent+4)
ls.append(item)
return start + '\n' + ',\n'.join(ls) + '\n' + ' ' * (indent - 4) + end
Example
import json
var = {1: {1: {1: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0},
2: {1: {1: 0}, 2: 0},
3: {1: 0}},
2: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0},
3: {1: {1: 0}, 2: 0}},
2: {1: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0},
2: {1: {1: 0}, 2: 0},
3: {1: 0}},
3: {1: {1: {1: 0}, 2: 0}, 2: {1: 0}, 3: 0}}
dumped = json.dumps(var, indent=4)
repred = represent(var)
print('dumped:')
print(dumped)
print('repred:')
print(repred)
print(f'{(eval(dumped) == var)=}')
print(f'{(eval(repred) == var)=}')
var1 = {'a': {'a': {'a': [0, 0, 0], 'b': [0, 0, 1], 'c': [0, 0, 2]},
'b': {'a': [0, 0, 1], 'b': [0, 1, 1], 'c': [0, 1, 2]},
'c': {'a': [0, 0, 2], 'b': [0, 1, 2], 'c': [0, 2, 2]}},
'b': {'a': {'a': [0, 0, 1], 'b': [0, 1, 1], 'c': [0, 1, 2]},
'b': {'a': [0, 1, 1], 'b': [1, 1, 1], 'c': [1, 1, 2]},
'c': {'a': [0, 1, 2], 'b': [1, 1, 2], 'c': [1, 2, 2]}},
'c': {'a': {'a': [0, 0, 2], 'b': [0, 1, 2], 'c': [0, 2, 2]},
'b': {'a': [0, 1, 2], 'b': [1, 1, 2], 'c': [1, 2, 2]},
'c': {'a': [0, 2, 2], 'b': [1, 2, 2], 'c': [2, 2, 2]}}}
print(represent(var1))
dumped:
{
"1": {
"1": {
"1": {
"1": {
"1": {
"1": 0
},
"2": 0
},
"2": {
"1": 0
},
"3": 0
},
"2": {
"1": {
"1": 0
},
"2": 0
},
"3": {
"1": 0
}
},
"2": {
"1": {
"1": {
"1": 0
},
"2": 0
},
"2": {
"1": 0
},
"3": 0
},
"3": {
"1": {
"1": 0
},
"2": 0
}
},
"2": {
"1": {
"1": {
"1": {
"1": 0
},
"2": 0
},
"2": {
"1": 0
},
"3": 0
},
"2": {
"1": {
"1": 0
},
"2": 0
},
"3": {
"1": 0
}
},
"3": {
"1": {
"1": {
"1": 0
},
"2": 0
},
"2": {
"1": 0
},
"3": 0
}
}
repred:
{
1: {
1: {
1: {
1: {
1: {
1: 0
},
2: 0
},
2: {
1: 0
},
3: 0
},
2: {
1: {
1: 0
},
2: 0
},
3: {
1: 0
}
},
2: {
1: {
1: {
1: 0
},
2: 0
},
2: {
1: 0
},
3: 0
},
3: {
1: {
1: 0
},
2: 0
}
},
2: {
1: {
1: {
1: {
1: 0
},
2: 0
},
2: {
1: 0
},
3: 0
},
2: {
1: {
1: 0
},
2: 0
},
3: {
1: 0
}
},
3: {
1: {
1: {
1: 0
},
2: 0
},
2: {
1: 0
},
3: 0
}
}
(eval(dumped) == var)=False
(eval(repred) == var)=True
{
'a': {
'a': {
'a': [
0,
0,
0
],
'b': [
0,
0,
1
],
'c': [
0,
0,
2
]
},
'b': {
'a': [
0,
0,
1
],
'b': [
0,
1,
1
],
'c': [
0,
1,
2
]
},
'c': {
'a': [
0,
0,
2
],
'b': [
0,
1,
2
],
'c': [
0,
2,
2
]
}
},
'b': {
'a': {
'a': [
0,
0,
1
],
'b': [
0,
1,
1
],
'c': [
0,
1,
2
]
},
'b': {
'a': [
0,
1,
1
],
'b': [
1,
1,
1
],
'c': [
1,
1,
2
]
},
'c': {
'a': [
0,
1,
2
],
'b': [
1,
1,
2
],
'c': [
1,
2,
2
]
}
},
'c': {
'a': {
'a': [
0,
0,
2
],
'b': [
0,
1,
2
],
'c': [
0,
2,
2
]
},
'b': {
'a': [
0,
1,
2
],
'b': [
1,
1,
2
],
'c': [
1,
2,
2
]
},
'c': {
'a': [
0,
2,
2
],
'b': [
1,
2,
2
],
'c': [
2,
2,
2
]
}
}
}
I am mainly concerned about performance and memory consumption, and I want the function to execute as fast as possible while utilizing as little RAM as possible. How can it be more efficient?
Update
Actually sorting the items while serializing the dictionaries does introduce bugs that somehow change the data represented, breaking the original association between the key value pairs, and this is definitely not intended.
Removing the sorted calls eliminates the bug, I have fixed my copy of the code but as this is code review and I have received answers I won't edit code posted above (lest the update be rolled back), so I decided to point this out.
More specifically,
This snippet:
if isinstance(obj, dict):
start, end = '{}'
for k, v in sorted(obj.items()):
if not isinstance(v, supported):
item = ' '*indent + repr(k) + ': ' + repr(v)
else:
item = ' '*indent + repr(k) + ': ' + represent(v, indent+4)
ls.append(item)
MUST be changed to:
if isinstance(obj, dict):
start, end = '{}'
for k, v in obj.items():
if not isinstance(v, supported):
item = ' '*indent + repr(k) + ': ' + repr(v)
else:
item = ' '*indent + repr(k) + ': ' + represent(v, indent+4)
ls.append(item)
I now think sorting the collections while serializing them to be a terrible practice, but I don't know if this bug also affects nested frozensets too (sets can't be nested because sets are mutable therefore unhashable), I haven't tested yet, but I recommend dropping the sorted calls on frozenset and set too (frozensets can be nested inside sets and frozensets).
Update
Please review the latest version: Serializing (nested) data structures in a human-readable format with all bugs fixed
indentparameter of therespresentfunction toindentLevelorindentCountor something along those lines, and define a functioncreateIndent(indentLevel: int) -> strwhich returns' ' * 4 * indentLevel. (Naming is perhaps not the best yet, but you get the idea). Then replace all' '*indentin your code by a call to that function, or call the function once and store the result in a variable. \$\endgroup\$