Revisions to What is an efficient ways to parse a bar separated usr file in Python

Scale, and repeated group aggregation

Source Link

edited Apr 30, 2020 at 13:52

71.1k
5
76
256

I still do not understand your data format. You only answered about half of my questions, so I'm going to go out on a limb a bit.

Some suggestions for you:

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants
The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indicationGiven your description of scale, I offerthis is a simple, pure-Python implementation withoutvery parallelizable problem and could easily be framed as a whole lot of regard to performance.standard Python multi-processing program
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entriesthey are aggregated to a list of lists with no regard for the same groupuniqueness
In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

from collections import defaultdict
from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}defaultdict(list)

    for line in lines:
        group, *entries = prev_line
        groups[group] = .append(dict(zip(GROUPS[group], entries)))
        prev_line = line

    return {
        'header': dict(zip(HEADER_NAMES, heads)),
        'footer': dict(zip(FOOTER_NAMES, prev_line)),
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

This produces:

{'footer': {'FootKey1': 'Footer1',
            'FootKey2': 'Footer2',
            'FootKey3': 'Footer3'},
 'groups': defaultdict(<class 'list'>,
                       {'A': [{'A1ValueKey': 'Entry1',
                               'A2ValueKey': 'Entry2',
                               'A3ValueKey': 'Entry3'}],
                        'B': [{'B1ValueKey': 'Entry1',
                               'B2ValueKey': 'Entry2',
                               'B3ValueKey': 'Entry3'},
                              {'B1ValueKey': 'Entry4',
                               'B2ValueKey': 'Entry5',
                               'B3ValueKey': 'Entry6'}]}),
 'header': {'HeaderKey1': 'Header1',
            'HeaderKey2': 'Header2',
            'HeaderKey3': 'Header3'}}

I still do not understand your data format. You only answered about half of my questions, so I'm going to go out on a limb a bit.

Some suggestions for you:

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants
The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group
In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = dict(zip(GROUPS[group], entries))
        prev_line = line

    return {
        'header': dict(zip(HEADER_NAMES, heads)),
        'footer': dict(zip(FOOTER_NAMES, prev_line)),
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

Some suggestions for you:

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants
The standard terminology for the opposite of "header" is "footer", not "trailer"
Given your description of scale, this is a very parallelizable problem and could easily be framed as a standard Python multi-processing program
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, they are aggregated to a list of lists with no regard for uniqueness
In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

from collections import defaultdict
from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = defaultdict(list)

    for line in lines:
        group, *entries = prev_line
        groups[group].append(dict(zip(GROUPS[group], entries)))
        prev_line = line

    return {
        'header': dict(zip(HEADER_NAMES, heads)),
        'footer': dict(zip(FOOTER_NAMES, prev_line)),
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

This produces:

{'footer': {'FootKey1': 'Footer1',
            'FootKey2': 'Footer2',
            'FootKey3': 'Footer3'},
 'groups': defaultdict(<class 'list'>,
                       {'A': [{'A1ValueKey': 'Entry1',
                               'A2ValueKey': 'Entry2',
                               'A3ValueKey': 'Entry3'}],
                        'B': [{'B1ValueKey': 'Entry1',
                               'B2ValueKey': 'Entry2',
                               'B3ValueKey': 'Entry3'},
                              {'B1ValueKey': 'Entry4',
                               'B2ValueKey': 'Entry5',
                               'B3ValueKey': 'Entry6'}]}),
 'header': {'HeaderKey1': 'Header1',
            'HeaderKey2': 'Header2',
            'HeaderKey3': 'Header3'}}

added 81 characters in body

Source Link

edited Apr 29, 2020 at 21:05

Reinderien

71.1k
5
76
256

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants

The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group

In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = {
            k: e
            for k, e in dict(zip(GROUPS[group], entries)
        })
        prev_line = line

    return {
        'header': {k: h for k, h in dict(zip(HEADER_NAMES, heads)}),
        'footer': {k: f for k, f in dict(zip(FOOTER_NAMES, prev_line)}),
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

Avoid global code
Make constants capitalized
The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = {
            k: e
            for k, e in zip(GROUPS[group], entries)
        }
        prev_line = line

    return {
        'header': {k: h for k, h in zip(HEADER_NAMES, heads)},
        'footer': {k: f for k, f in zip(FOOTER_NAMES, prev_line)},
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants

The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group

In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = dict(zip(GROUPS[group], entries))
        prev_line = line

    return {
        'header': dict(zip(HEADER_NAMES, heads)),
        'footer': dict(zip(FOOTER_NAMES, prev_line)),
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

added 81 characters in body

Source Link

edited Apr 29, 2020 at 20:59

Reinderien

71.1k
5
76
256

I still do not understand your data format. You only answered about half of my questions, so I'm going to go out on a limb a bit.

Some suggestions for you:

Avoid global code
Make constants capitalized
The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group

The suggested code:

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = {
            k: e
            for k, e in zip(GROUPS[group], entries)
        }
        prev_line = line

    return {
        'header': {k: h for k, h in zip(HEADER_NAMES, heads)},
        'footer': {k: f for k, f in zip(FOOTER_NAMES, prev_line)},
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

I still do not understand your data format. You only answered about half of my questions, so I'm going to go out on a limb a bit.

Some suggestions for you:

Avoid global code
Make constants capitalized
The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group

The suggested code:

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = {
            k: e
            for k, e in zip(GROUPS[group], entries)
        }
        prev_line = line

    return {
        'header': {k: h for k, h in zip(HEADER_NAMES, heads)},
        'footer': {k: f for k, f in zip(FOOTER_NAMES, prev_line)},
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

I still do not understand your data format. You only answered about half of my questions, so I'm going to go out on a limb a bit.

Some suggestions for you:

Avoid global code
Make constants capitalized
The standard terminology for the opposite of "header" is "footer", not "trailer"
Since you did not provide any indication of scale, I offer a simple, pure-Python implementation without a whole lot of regard to performance.
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, the last one wins and overwrites any former entries for the same group

The suggested code:

from pprint import pprint
from typing import Iterable, List, Sequence

HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
          'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}


def parse(fn: str) -> Iterable[List[str]]:
    with open(fn) as f:
        yield from (
            line.rstrip().split('|')
            for line in f
        )


def load(lines: Iterable[Sequence[str]]) -> dict:
    lines = iter(lines)
    heads = next(lines)
    prev_line = next(lines)

    groups = {}

    for line in lines:
        group, *entries = prev_line
        groups[group] = {
            k: e
            for k, e in zip(GROUPS[group], entries)
        }
        prev_line = line

    return {
        'header': {k: h for k, h in zip(HEADER_NAMES, heads)},
        'footer': {k: f for k, f in zip(FOOTER_NAMES, prev_line)},
        'groups': groups,
    }


if __name__ == '__main__':
    d = load(parse('file1.usr'))
    pprint(d)

Source Link

answered Apr 29, 2020 at 20:48

Reinderien

71.1k
5
76
256

Loading

Stack Exchange Network

Return to Answer