5
\$\begingroup\$

This simple task of sorting a list of related RAR files of a split archive turned out to be more cumbersome than expected, since there are two versions of the RAR naming scheme, v3 and v5.

  • v3 is like file.rar, file.r00, file.r01 (fixed two digit numbering, except for .rar which is the first part)
  • v5 is like file.part1.rar, file.part2.rar, file.part3.rar (if necessary, the numbering will include additional leading zeros, i.e. file.part01.rar)

The function should accept both versions, make a basic sanity check and sort them correctly.

There is one thing missing in the verification, that is checking for gaps in the list of files. But, this is perhaps not that useful since a missing file may be exactly the last one, and so this error can only be reliably diagnosed by actually testing the rar files via opening them using the unrar library (which makes no sense at this stage).

I know that the code uses a lot of map constructs, but I find it easier to reason with them. Also, the lambdas introduce a new scope which IMHO prevents subtle bugs in comparison to list comprehensions.

Using regexes might make this marginally easier, but probably not worth it.

from pathlib import Path


def rar_list_verify(rar_list: list[Path]) -> None:
    """Verify that the rar files belong to each other and have the correct suffixes"""
    if not all(
        map(
            lambda s: (
                len(s) == 4 and s[1] == "r" and s[2].isdigit() and s[3].isdigit()
            )
            or (s == ".rar"),
            set(map(lambda f: f.suffix, rar_list)),
        )
    ):
        raise ValueError("Invalid suffixes")

    if (
        len(
            set(
                map(
                    lambda f: f.parent / f.stem if f.suffix.startswith(".part") else f,
                    map(lambda f: f.parent / f.stem, rar_list),
                )
            )
        )
        != 1
    ):
        raise ValueError("Inconsistent files")


def rar_sort(rar_list: list[Path]) -> list[Path]:
    """Sort a list of rar files in the correct order, either v5 or v3"""

    rar_list_verify(rar_list)

    return list(
        map(
            lambda f: f[0],
            sorted(
                map(
                    lambda f: (
                        f,
                        f.stem,  # we know there is only one containing path
                        -1 if f.suffix == ".rar" else int(f.suffix[2:]),
                    ),
                    rar_list,
                ),
                key=lambda f: (f[1], f[2]),
            ),
        )
    )
\$\endgroup\$
1
  • \$\begingroup\$ Yikes! Name your functions already. A short anonymous lambda can be fine. But when not even black can make your code legible, that's a hint that maybe you have nested a bit too deeply. Call named helpers. Introduce the occasional temp variable holding an intermediate result, to break a giant computation down into bite sized pieces. // Write the occasional unit test. \$\endgroup\$ Commented Jan 27 at 15:58

2 Answers 2

6
\$\begingroup\$

I know that the code uses a lot of map constructs, but I find it easier to reason with them.

I don't, and to make a (perhaps bold) claim about the "typical programmer", I don't think that they would either. The functional-nesting map/lambda style is difficult to understand and maintain. I discourage this sort of code. It also loses the ability to generate meaningful messages describing the specific path that produces an error.

Also, the lambdas introduce a new scope which IMHO prevents subtle bugs in comparison to list comprehensions

I'm not sure what you mean by new scope in this context, but overall this seems incorrect. Lambdas have closure scope, so they have access to all of the symbols above them. The parameter is constrained to the scope of the lambda, but this is no different from variables of iteration in a comprehension which are also discarded outside of the comprehension.

Using regexes might make this marginally easier, but probably not worth it

Well, whether it's easier or not, you're currently not validating enough of the path sequence consistency, and regexes will do a more thorough job; so I do consider them worth it. They will be able to capture and verify a consistent stem, suffix and suffix index. pathlib is helpful but not helpful enough.

There is one thing missing in the verification, that is checking for gaps in the list of files. But this is perhaps not that useful since a missing file may be exactly the last one

Those are two different scenarios. A missing last-file will indeed be impossible to detect without attempting a decompress, but a gap anywhere else will be visible.

Write unit tests.

Your code would benefit from detecting and reporting the RAR naming variant used.

Validation Demo

import enum
import re
import typing
from pathlib import Path


class RarVersion(enum.IntEnum):
    AMBIGUOUS = 0
    V3 = 3
    V5 = 5


V3_PAT = re.compile(
r'''(?x)
    ^       # start
    (?P<stem>
        .+  # require a stem of at least one character
    )
    \.      # suffix separator dot
    (?P<suffix>
        rar |  # first literal 'rar' suffix
        r(?P<index>
            \d\d  # two-digit index
        )
    )
    $  # end
''')

V5_PAT = re.compile(
r'''(?x)
    ^       # start
    (?P<stem>
        .+  # require a stem of at least one character
    )
    \.part   # beginning of first suffix component
    (?P<index>
        \d+  # at least one digit
    )
    \.       # beginning of last suffix component
    (?P<suffix>
        rar  # last suffix component
    )
    $        # end
''')


def rar_list_verify(paths: typing.Sequence[str | Path]) -> RarVersion:
    if len(paths) == 0:
        # Since there is no non-indexed .rar, this must be interpreted as an "empty V5"
        return RarVersion.V5

    matches = [V5_PAT.match(str(p)) for p in paths]

    if any(m is None for m in matches):
        matches = [V3_PAT.match(str(p)) for p in paths]
        version = RarVersion.V3

        for path, match in zip(paths, matches):
            if match is None:
                raise ValueError(f'"{path}" does not match the version-3 pattern')
    elif len(paths) > 1:
        version = RarVersion.V5
    else:
        version = RarVersion.AMBIGUOUS

    stem = matches[0]['stem']
    for i, match in enumerate(matches[1:], start=1):
        if match['stem'] != stem:
            raise ValueError(f'{paths[i]} has an inconsistent stem')

    actual = {
        int(match['index'])
        for match in matches
        if match['index']
    }

    match version:
        case RarVersion.V3:
            base = 0
            count = len(paths) - 1
        case RarVersion.V5:
            base = 1
            count = len(paths)
        case RarVersion.AMBIGUOUS:
            # It's only possible for this to be a valid V5 if the only index is 1
            if actual == {1}:
                return version
            version = RarVersion.V3
            base = 0
            count = 0

    if version == RarVersion.V3:
        n_unnumbered = sum(
            1 for match in matches
            if match['suffix'] == 'rar'
        )
        if n_unnumbered != 1:
            raise ValueError(f'{n_unnumbered} paths have a non-indexed suffix; must be exactly one')

    if count > 0:
        expected = set(range(base, base + count))
        spurious = actual - expected
        if spurious:
            raise ValueError(
                'The following indices are unexpected: '
                + ', '.join(str(i) for i in spurious)
            )
        missing = expected - actual
        if missing:
            raise ValueError(
                'The following indices are missing: '
                + ', '.join(str(i) for i in missing)
            )

    return version


def test() -> None:
    assert rar_list_verify(
        ('a.part1.rar', 'a.part2.rar')
    ) == RarVersion.V5, 'Simple V5'

    assert rar_list_verify(
        ('a.rar', 'a.r00', 'a.r01')
    ) == RarVersion.V3, 'Simple V3'

    assert rar_list_verify(
        ('a.rar',)
    ) == RarVersion.V3, 'Almost ambiguous but cannot be V5'

    assert rar_list_verify(
        ('a.part1.rar',)
    ) == RarVersion.AMBIGUOUS, 'Actually ambiguous even though it is likely V5'

    assert rar_list_verify(
        ('a.part2.rar',)
    ) == RarVersion.V3, 'Invalid index forces this to be interpreted as V3'

    assert rar_list_verify(()) == RarVersion.V5, 'Empty input is only interpretable as a V5'

    try:
        rar_list_verify(('',))
        raise AssertionError('Bad format')
    except ValueError as e:
        assert str(e) == '"" does not match the version-3 pattern'

    try:
        rar_list_verify(('a.rar', 'b.r00'))
        raise AssertionError('Disparate stems')
    except ValueError as e:
        assert str(e) == 'b.r00 has an inconsistent stem'

    try:
        rar_list_verify(('a.r00', 'a.r01'))
        raise AssertionError('Missing non-indexed suffix')
    except ValueError as e:
        assert str(e) == '0 paths have a non-indexed suffix; must be exactly one'

    try:
        rar_list_verify(('a.rar', 'a.rar'))
        raise AssertionError('Duplicate non-indexed suffixes')
    except ValueError as e:
        assert str(e) == '2 paths have a non-indexed suffix; must be exactly one'

    try:
        rar_list_verify(('a.part0.rar', 'a.part1.rar'))
        raise AssertionError('V5 indexed from wrong base value')
    except ValueError as e:
        assert str(e) == 'The following indices are unexpected: 0'

    try:
        rar_list_verify(('a.part1.rar', 'a.part1.rar'))
        raise AssertionError('V5 missing an index')
    except ValueError as e:
        assert str(e) == 'The following indices are missing: 2'


if __name__ == '__main__':
    test()

Sorting Demo

To sort after parsing and validation, I propose that you expose an intermediate path object that remembers the index. This will better reuse the effort from the parse step.

import enum
import os
import re
import typing
from pathlib import Path


class RarVersion(enum.IntEnum):
    AMBIGUOUS = 0
    V3 = 3
    V5 = 5


V3_PAT = re.compile(
r'''(?x)
    ^       # start
    (?P<stem>
        .+  # require a stem of at least one character
    )
    \.      # suffix separator dot
    (?P<suffix>
        rar |  # first literal 'rar' suffix
        r(?P<index>
            \d\d  # two-digit index
        )
    )
    $  # end
''')

V5_PAT = re.compile(
r'''(?x)
    ^       # start
    (?P<stem>
        .+  # require a stem of at least one character
    )
    \.part   # beginning of first suffix component
    (?P<index>
        \d+  # at least one digit
    )
    \.       # beginning of last suffix component
    (?P<suffix>
        rar  # last suffix component
    )
    $        # end
''')


class RARPath(typing.NamedTuple):
    index: int
    path: str
    stem: str
    suffix: str

    @classmethod
    def from_match(cls, match: re.Match) -> typing.Self:
        return cls(
            index=-1 if match['index'] is None else int(match['index']),
            path=match.string,
            stem=match['stem'],
            suffix=match['suffix'],
        )

    def __str__(self) -> str:
        return self.path


def parse_rar_list(paths: typing.Sequence[str | Path]) -> tuple[RarVersion, list[RARPath]]:
    if len(paths) == 0:
        # Since there is no non-indexed .rar, this must be interpreted as an "empty V5"
        return RarVersion.V5, []

    matches = [V5_PAT.match(str(p)) for p in paths]

    if any(m is None for m in matches):
        matches = [V3_PAT.match(str(p)) for p in paths]
        version = RarVersion.V3

        for path, match in zip(paths, matches):
            if match is None:
                raise ValueError(f'"{path}" does not match the version-3 pattern')
    elif len(paths) > 1:
        version = RarVersion.V5
    else:
        version = RarVersion.AMBIGUOUS

    parsed = [RARPath.from_match(match) for match in matches]

    stem = parsed[0].stem
    for match in parsed[1:]:
        if match.stem != stem:
            raise ValueError(f'{match} has an inconsistent stem')

    actual = {match.index for match in parsed}

    match version:
        case RarVersion.V3:
            base = -1
        case RarVersion.V5:
            base = 1
        case RarVersion.AMBIGUOUS:
            # It's only possible for this to be a valid V5 if the only index is 1
            if actual == {1}:
                return version, parsed
            version = RarVersion.V3
            base = -1

            # This started as an ambiguous case where the index might have been part of a V5 suffix.
            # Since we've ruled that out, the actual index set is reinterpreted as the base only (-1).
            actual = {-1}

    if version == RarVersion.V3:
        n_unnumbered = sum(
            1 for match in parsed
            if match.suffix == 'rar'
        )
        if n_unnumbered != 1:
            raise ValueError(f'{n_unnumbered} paths have a non-indexed suffix; must be exactly one')

    expected = set(range(base, base + len(paths)))
    spurious = actual - expected
    if spurious:
        raise ValueError(
            'The following indices are unexpected: '
            + ', '.join(str(i) for i in spurious)
        )
    missing = expected - actual
    if missing:
        raise ValueError(
            'The following indices are missing: '
            + ', '.join(str(i) for i in missing)
        )

    return version, parsed


def rar_sort(paths: typing.Iterable[RARPath]) -> list[RARPath]:
    version, parsed = parse_rar_list(paths)
    return [path.path for path in sorted(parsed)]


def test_parse() -> None:
    assert parse_rar_list(
        ('a.part1.rar', 'a.part2.rar')
    )[0] == RarVersion.V5, 'Simple V5'

    assert parse_rar_list(
        ('a.rar', 'a.r00', 'a.r01')
    )[0] == RarVersion.V3, 'Simple V3'

    assert parse_rar_list(
        ('a.rar',)
    )[0] == RarVersion.V3, 'Almost ambiguous but cannot be V5'

    assert parse_rar_list(
        ('a.part1.rar',)
    )[0] == RarVersion.AMBIGUOUS, 'Actually ambiguous even though it is likely V5'

    assert parse_rar_list(
        ('a.part2.rar',)
    )[0] == RarVersion.V3, 'Invalid index forces this to be interpreted as V3'

    assert parse_rar_list(())[0] == RarVersion.V5, 'Empty input is only interpretable as a V5'

    try:
        parse_rar_list(('',))
        raise AssertionError('Bad format')
    except ValueError as e:
        assert str(e) == '"" does not match the version-3 pattern'

    try:
        parse_rar_list(('a.rar', 'b.r00'))
        raise AssertionError('Disparate stems')
    except ValueError as e:
        assert str(e) == 'b.r00 has an inconsistent stem'

    try:
        parse_rar_list(('a.r00', 'a.r01'))
        raise AssertionError('Missing non-indexed suffix')
    except ValueError as e:
        assert str(e) == '0 paths have a non-indexed suffix; must be exactly one'

    try:
        parse_rar_list(('a.rar', 'a.rar'))
        raise AssertionError('Duplicate non-indexed suffixes')
    except ValueError as e:
        assert str(e) == '2 paths have a non-indexed suffix; must be exactly one'

    try:
        parse_rar_list(('a.part0.rar', 'a.part1.rar'))
        raise AssertionError('V5 indexed from wrong base value')
    except ValueError as e:
        assert str(e) == 'The following indices are unexpected: 0'

    try:
        parse_rar_list(('a.part1.rar', 'a.part1.rar'))
        raise AssertionError('V5 missing an index')
    except ValueError as e:
        assert str(e) == 'The following indices are missing: 2'


def test_sort() -> None:
    assert rar_sort(('a.r00', 'a.rar', 'a.r01')) == [
        'a.rar', 'a.r00', 'a.r01',
    ], 'Simple v3 sort'

    assert rar_sort(('a.part2.rar', 'a.part1.rar')) == [
        'a.part1.rar', 'a.part2.rar',
    ]


if __name__ == '__main__':
    test_parse()
    test_sort()
\$\endgroup\$
2
  • 1
    \$\begingroup\$ 1.) "but this is no different from variables of iteration in a comprehension which are also discarded outside of the comprehension" shockingly this wasn't the case in the past! So people used map / lambda because the control variable of the list comprehension would indeed overwrite a variable of the same name. But this quirk has been fixed for a very long time. 2.) Since map is lazy, it is more performant than list comprehension when used with all like here. Of course, only if the negative case is common and building the objects of the list is costly, so not like here. \$\endgroup\$ Commented Jan 28 at 10:48
  • \$\begingroup\$ Thanks for all the work you put in, it's surprisingly complex. But one issue remains: V5 can also have filenames like archive.rar but only if it is not a split archive. So, IMHO if len(paths) == 1 it's always ambiguous. \$\endgroup\$ Commented Jan 29 at 18:35
5
\$\begingroup\$

I agree with the previous answer in that I also find the code hard to understand, mainly due to the usage of lambda and map.

Here are some of the things I do like:

  • You separated the code out into functions based on purpose
  • The functions are well-named
  • You added basic docstrings
  • You added helpful type hints
  • The usage of the string isdigit method

Documentation

You could add more details to the docstrings, elaborating on the RAR file type differences that you mentioned in the body of the question (v3 vs. v5).

Naming

One thing that makes your code hard for me to understand are the short names s and f.

It is common to pluralize the name of array variables. Instead of rar_list, consider rar_files.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.