Some suggestions for you:
- Avoid global code
- Make constants capitalized
- Use tuples instead of lists for immutable constants
- The standard terminology for the opposite of "header" is "footer", not "trailer"
- Given your description of scale, this is a very parallelizable problem and could easily be framed as a standard Python multi-processing program
- The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
- I have assumed that you wish to remain printing the dictionary out to
stdout, in which casepprintis more appropriate. If you want to serialize this to JSON, that is trivial using thejsonmodule. - I have assumed that in the case of repeated groups, they are aggregated to a list of lists with no regard for uniqueness
- In the other answer, the suggestion is good to pass the result of
zipdirectly to thedictconstructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.
The suggested code:
from collections import defaultdict
from pprint import pprint
from typing import Iterable, List, Sequence
HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}
def parse(fn: str) -> Iterable[List[str]]:
with open(fn) as f:
yield from (
line.rstrip().split('|')
for line in f
)
def load(lines: Iterable[Sequence[str]]) -> dict:
lines = iter(lines)
heads = next(lines)
prev_line = next(lines)
groups = defaultdict(list)
for line in lines:
group, *entries = prev_line
groups[group].append(dict(zip(GROUPS[group], entries)))
prev_line = line
return {
'header': dict(zip(HEADER_NAMES, heads)),
'footer': dict(zip(FOOTER_NAMES, prev_line)),
'groups': groups,
}
if __name__ == '__main__':
d = load(parse('file1.usr'))
pprint(d)
This produces:
{'footer': {'FootKey1': 'Footer1',
'FootKey2': 'Footer2',
'FootKey3': 'Footer3'},
'groups': defaultdict(<class 'list'>,
{'A': [{'A1ValueKey': 'Entry1',
'A2ValueKey': 'Entry2',
'A3ValueKey': 'Entry3'}],
'B': [{'B1ValueKey': 'Entry1',
'B2ValueKey': 'Entry2',
'B3ValueKey': 'Entry3'},
{'B1ValueKey': 'Entry4',
'B2ValueKey': 'Entry5',
'B3ValueKey': 'Entry6'}]}),
'header': {'HeaderKey1': 'Header1',
'HeaderKey2': 'Header2',
'HeaderKey3': 'Header3'}}