Revisions to Shell script to count chess game outcomes

added tldr, discussed search regex, and some minor edits

Source Link

edited Sep 29, 2024 at 15:52

1.2k
9
21

TL;DR

frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

Finishes in 2.53 seconds versus OP's (20.05 seconds), jano's (19.71 seconds) and Adam's (6.48 seconds) solutions.

I've inadvertently hit the default maximum max-chars limit while testing;testing. youYou may force the issue by:

creating a directory with a very long path name to hold the data repositoryset and running your commands from there,
increasing the size of your environment by populating it with random data:
```
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
```
decreasing the maximum stack size temporarily by running
```
ulimit -S -s STACK_SIZE
```
where STACK_SIZE is some number lower than the default 8192, e.g. 256,
letting the number of PGN files grow organically,

Notice here I've added a BEGIN rule to initialize the array. This is done to ensure the statistics will be correct: if the collection of files fed to an invocation of AWKawk is devoid of certain outcomes (white wins, white loses, or a draw), the slots counting the missing outcomes would output empty strings, which when passed to the final aggregating AWK program would result in misgrouping and tainting the counts. You can see this effect come into play by changing the -n4 option of xargs to -n1 or -n2 in Adam's solution reproduced below:

A fixed string search for Result, while fast and simple, is a bit naive. In a prior version of the file Britbase/197008bcf.pgn, there is a comment in the movetext that includes both the string Result and a score 1-0:

1. e4 c6 2. d4 g6 3. Nc3 Bg7 4. Be3 d6 5. Qd2 Nd7 6. f3 Qa5 7. Bc4 b5 8. Bb3 b4
9. Nd1 Ba6 10. Ne2 Ngf6 11. c3 c5 12. cxb4 cxb4 13. Nf2 O-O 14. O-O Rfc8 15.
Rfc1 Nb6 16. Nf4 Nc4 17. Bxc4 Bxc4 18. N4d3 Rab8 19. a3 Qb5 20. Nxb4 a5 21.
Nbd3 Nd7 22. b4 a4 23. Nb2 Bb3 24. Rxc8+ Rxc8 25. Rc1 Rxc1+ 26. Qxc1 Qe2 27.
Nfd1 {Result given as 1-0 in the tournament bulletin but it is clear from
published results and crosstables, as well as the position on the board (where
Black has mate in one) that the true result was 0-1 - JS, BritBase} 0-1

This is a rare occurrence, indeed, but I'd personally perfer the more explicit search pattern $'^\[Result "(1-0|0-1|1/2-1/2)"\]\r*$' that adheres to the PGN export format described in the PGN standard document. It is still possible to have false negatives, but trying to match the versatile PGN import format is likely to hurt performance and needlessly increase complexity.

Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to AWKawk, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!

Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk2mawk 2 based on mawk 1.3.33's code. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers, one (one per line), it runs slower in all backend, optimization and parallelization configurations.

After ample experimentation, I came up with two solutions that are blazingly fast on modern hardware:

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
    '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    { ++a[$2] }
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

# rg3.sh
frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

Before I present my benchmark results, I want to touch on one techniquemore idea that maymight improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.

We may obtaindownload the data set on GitHub by downloading asas a gzip file and decompressingdecompress it using pigz (parallel gzip):

This runs in 45 seconds on my machine. GZIP is not the most optimal lossless data compression file format, neither is pigz the most parallel decompressor. We could switch to zstandardZstandard by recompressing the data set with pzstd:

This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not trulyfully benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.

A solution to the problem statement utilizing pzstd is:

Data sourcessets:

Normal: before each run, instruct the OS to drop caches.

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

Cached: do not drop caches before runseach run.
POSIX commands: only POSIX utilities are allowed; substitute mawk and frawk with awk (gawk), and rg with grep. Due to incompatibilities in the command-lines, grep.shrg and (ripgrep) could not simply be replaced with grep3.shgrep are just POSIX equivalents of, so rg.sh and rg3.sh because their command-lines are incompatible had to be rewritten as grep.sh and grep3.sh.

And I'll quickly introduce the remaining contenders:

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
    '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    { ++a[$2] }
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

# rg3.sh
frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

Note: adam2.sh is adam.sh patched to utilize all available cores (-n4 -P4 becomes -n$n -P$n where n=$(nproc). I didn't test for the best possible value for -n as that would be ad hoc micro-optimization, so I just made it the same as the number for -P.

And the medians ofLet's examine the median running times of each contender in seconds:the benchmarks.

ChessData-master:

ChessData-master, cached:

ChessData-master, POSIX commands:

LumbrasGigaBase:

Looks like theThe clear winner is rg3.sh, and for POSIX commands, op2.sh.

I've inadvertently hit the default maximum max-chars limit while testing; you may force the issue by:

creating a directory with a very long path name to hold the data repository and running your commands from there,
increasing the size of your environment by populating it with random data:
```
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
```
decreasing the maximum stack size temporarily by running
```
ulimit -S -s STACK_SIZE
```
where STACK_SIZE is some number lower than the default 8192, e.g. 256,
letting the number of PGN files grow organically,

Notice here I've added a BEGIN rule to initialize the array. This is done to ensure the statistics will be correct: if the collection of files fed to an invocation of AWK is devoid of certain outcomes (white wins, white loses, or a draw), the slots counting the missing outcomes would output empty strings, which when passed to the final aggregating AWK program would result in misgrouping and tainting the counts. You can see this effect come into play by changing the -n4 option of xargs to -n1 or -n2 in Adam's solution reproduced below:

Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to AWK, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!

Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk2 based on mawk 1.3.3. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers, one per line, it runs slower in all backend, optimization and parallelization configurations.

Before I present my benchmark results, I want to touch on one technique that may improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.

We may obtain the data on GitHub by downloading as a gzip file and decompressing it using pigz (parallel gzip):

This runs in 45 seconds on my machine. GZIP is not the most optimal format, neither is pigz the most parallel decompressor. We could switch to zstandard by recompressing the data with pzstd:

This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not truly benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.

A solution to the problem utilizing pzstd is:

Data sources:

Normal: before each run, instruct the OS to drop caches.

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

Cached: do not drop caches before runs.
POSIX commands: only POSIX utilities are allowed; substitute mawk and frawk with awk (gawk), rg with grep. grep.sh and grep3.sh are just POSIX equivalents of rg.sh and rg3.sh because their command-lines are incompatible.

And I'll quickly introduce the remaining contenders:

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
    '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    { ++a[$2] }
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

# rg3.sh
frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

adam2.sh is adam.sh patched to utilize all available cores (-n4 -P4 becomes -n$n -P$n where n=$(nproc). I didn't test for the best possible value for -n as that would be ad hoc micro-optimization, so I just made it the same as the number for -P.

And the medians of the running times in seconds:

ChessData-master

ChessData-master, cached

ChessData-master, POSIX commands

LumbrasGigaBase

Looks like the clear winner is rg3.sh, and for POSIX commands, op2.sh.

TL;DR

frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

Finishes in 2.53 seconds versus OP's (20.05 seconds), jano's (19.71 seconds) and Adam's (6.48 seconds) solutions.

I've inadvertently hit the default maximum max-chars limit while testing. You may force the issue by:

creating a directory with a very long path name to hold the data set and running your commands from there,
increasing the size of your environment by populating it with random data:
```
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
```
decreasing the maximum stack size temporarily by running
```
ulimit -S -s STACK_SIZE
```
where STACK_SIZE is some number lower than the default 8192, e.g. 256,
letting the number of PGN files grow organically,

Notice here I've added a BEGIN rule to initialize the array. This is done to ensure the statistics will be correct: if the collection of files fed to an invocation of awk is devoid of certain outcomes (white wins, white loses, or a draw), the slots counting the missing outcomes would output empty strings, which when passed to the final aggregating AWK program would result in misgrouping and tainting the counts. You can see this effect come into play by changing the -n4 option of xargs to -n1 or -n2 in Adam's solution reproduced below:

A fixed string search for Result, while fast and simple, is a bit naive. In a prior version of the file Britbase/197008bcf.pgn, there is a comment in the movetext that includes both the string Result and a score 1-0:

1. e4 c6 2. d4 g6 3. Nc3 Bg7 4. Be3 d6 5. Qd2 Nd7 6. f3 Qa5 7. Bc4 b5 8. Bb3 b4
9. Nd1 Ba6 10. Ne2 Ngf6 11. c3 c5 12. cxb4 cxb4 13. Nf2 O-O 14. O-O Rfc8 15.
Rfc1 Nb6 16. Nf4 Nc4 17. Bxc4 Bxc4 18. N4d3 Rab8 19. a3 Qb5 20. Nxb4 a5 21.
Nbd3 Nd7 22. b4 a4 23. Nb2 Bb3 24. Rxc8+ Rxc8 25. Rc1 Rxc1+ 26. Qxc1 Qe2 27.
Nfd1 {Result given as 1-0 in the tournament bulletin but it is clear from
published results and crosstables, as well as the position on the board (where
Black has mate in one) that the true result was 0-1 - JS, BritBase} 0-1

This is a rare occurrence, indeed, but I'd personally perfer the more explicit search pattern $'^\[Result "(1-0|0-1|1/2-1/2)"\]\r*$' that adheres to the PGN export format described in the PGN standard document. It is still possible to have false negatives, but trying to match the versatile PGN import format is likely to hurt performance and needlessly increase complexity.

Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to awk, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!

Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk 2 based on mawk 1.3.3's code. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers (one per line), it runs slower in all backend, optimization and parallelization configurations.

After ample experimentation, I came up with two solutions that are blazingly fast on modern hardware:

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
    '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    { ++a[$2] }
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

# rg3.sh
frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

Before I present my benchmark results, I want to touch on one more idea that might improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.

We may download the data set on GitHub as a gzip file and decompress it using pigz (parallel gzip):

This runs in 45 seconds on my machine. GZIP is not the most optimal lossless data compression file format, neither is pigz the most parallel decompressor. We could switch to Zstandard by recompressing the data set with pzstd:

This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not fully benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.

A solution to the problem statement utilizing pzstd is:

Data sets:

Normal: before each run, instruct the OS to drop caches.

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

Cached: do not drop caches before each run.
POSIX commands: only POSIX utilities are allowed; substitute mawk and frawk with awk (gawk), and rg with grep. Due to incompatibilities in the command-lines, rg (ripgrep) could not simply be replaced with grep, so rg.sh and rg3.sh had to be rewritten as grep.sh and grep3.sh.

Note: adam2.sh is adam.sh patched to utilize all available cores (-n4 -P4 becomes -n$n -P$n where n=$(nproc). I didn't test for the best possible value for -n as that would be ad hoc micro-optimization, so I just made it the same as the number for -P.

Let's examine the median running times of each contender in the benchmarks.

ChessData-master:

ChessData-master, cached:

ChessData-master, POSIX commands:

LumbrasGigaBase:

The clear winner is rg3.sh.

Revisited the problem in 2024

Source Link

edited Sep 28, 2024 at 22:33

Gao

1.2k
9
21

I noticed the result lines we're interested in only comprise a small fraction of the contents in those files, so I figured that filtering the input files first for the result lines via the expert tool grep might be more efficient than feeding them directly to awk, which I recall splits each line into fields, so that might incur significant but unnecessary overhead.

Potential Bugs

Your solution, while able to eliminate one piped command relative to Adam's, introduces a new limitation: the output may consists of multiple intermediate results rather than one final statistics. From the xargs man page:

-s max-chars, --max-chars=max-chars
       Use at most max-chars characters per command line, including the command 
       and initial-arguments and the terminating nulls at the ends of the 
       argument strings.  The largest allowed value is system-dependent, and is 
       calculated as the argument length limit for exec, less the size of your 
       environment, less 2048 bytes of headroom.  If this value is more than 
       128KiB, 128Kib is used as the default value; otherwise, the default 
       value is the maximum.  1KiB is 1024 bytes.  xargs automatically adapts 
       to tighter constraints.

I've inadvertently hit the default maximum max-chars limit while testing; you may force the issue by:

creating a directory with a very long path name to hold the data repository and running your commands from there,

increasing the size of your environment by populating it with random data:
```
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
```

decreasing the maximum stack size temporarily by running
```
ulimit -S -s STACK_SIZE
```
where STACK_SIZE is some number lower than the default 8192, e.g. 256,

letting the number of PGN files grow organically,

or some combination of the above.

I'm quite surprised that your attempt at parallel execution slowed you down. If I obtained these benchmark results on my first logon sessionadapt your changes to Adam's code, it actually ranks among the fastest solutions I've tried. Here's what you'd have come up with some missing data:

# op2.sh
n=$(nproc)
find . -type f -name '*.pgn' -print0 | 
    xargs --null -n $n --max-procs=$n mawk -F '[-"]' \
        'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
         /Result/{ ++a[$2] } 
         END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' |
    mawk '{ games += $1; white += $2; black += $3; draw += $4; } 
          END { print games, white, black, draw }'

Notice here I've added a BEGIN rule to initialize the array. This is done to ensure the statistics will be correct: if the collection of files fed to an invocation of AWK is devoid of certain outcomes (7white wins, white loses, or a draw), the slots counting the missing outcomes would output empty strings, which when passed to the final aggregating AWK program would result in misgrouping and tainting the counts.5 GB You can see this effect come into play by changing the -n4 option of xargs to -n1 or -n2 in Adam's solution reproduced below:

# adam.sh
find . -type f -name '*.pgn' -print0 | 
    xargs -0 -n4 -P4 mawk \
        '/Result/{ 
            split($0, a, "-");
            res = substr(a[1], length(a[1]), 1);
            if (res == 1) white++; 
            if (res == 0) black++; 
            if (res == 2) draw++;
         } 
         END { print white + black + draw, white, black, draw }' | 
    mawk '{ games += $1; white += $2; black += $3; draw += $4; } 
          END { print games, white, black, draw }'

Better Tools

Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, due so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to AWK, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!

Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a full disk):beta release for mawk2 based on mawk 1.3.3. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers, one per line, it runs slower in all backend, optimization and parallelization configurations.

Compression

Before I present my benchmark results, I want to touch on one technique that may improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.

We may obtain the data on GitHub by downloading as a gzip file and decompressing it using pigz (parallel gzip):

tar --extract --use-compress-program='pigz' \
    --file='./ChessData-master.tar.gz'

This runs in 45 seconds on my machine. GZIP is not the most optimal format, neither is pigz the most parallel decompressor. We could switch to zstandard by recompressing the data with pzstd:

tar --create --use-compress-program='pzstd' \
    --file='./ChessData-master.tar.zst' ./ChessData-master

This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not truly benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.

A solution to the problem utilizing pzstd is:

# pzstd.sh
tar --extract --to-stdout --use-compress-program='pzstd' \
    --wildcards --wildcards-match-slash \
    --file='./ChessData-master.tar.zst' '*.pgn' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    /Result/{ ++a[$2] } 
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The Benchmarks

Benchmark PC1 specs:

Adam Drake's solution: 1m39s

 find . -type f -name '*.pgn' -print0 | 
     xargs -0 -n4 -P4 mawk '/Result/ { split($0, a, "-");
         res = substr(a[1], length(a[1]), 1);
         if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++ } 
         END { print white+black+draw, white, black, draw }' | 
     mawk '{games += $1; white += $2; black += $3; draw += $4; } 
         END { print games, white, black, draw }'

CPU: 8

RAM: 16 GB

Disk: NVMe SSD

OS: Fedora 40

Data sources:

OP's solution and janos' improvement: 1m22s-1m30s
ChessData-master
My improvement: 1m8s-1m18s
LumbrasGigaBase (pre-1899 to 2023, Version 2024-04-09)

Benchmark types:

Normal: before each run, instruct the OS to drop caches.

 export LC_ALL=C
 find . -type f -name '*.pgn' -execdir grep -hE '^\[Result' {} + | 
     mawk -F '[-"]' '{ ++a[$2] }
         END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

Cached: do not drop caches before runs.

POSIX commands: only POSIX utilities are allowed; substitute mawk and frawk with awk (gawk), rg with grep. grep.sh and grep3.sh are just POSIX equivalents of rg.sh and rg3.sh because their command-lines are incompatible.

Here I used a little known trick to speed up grep. I also toyed with grep -r, grep -F and grep -P, which seemed to make no difference at all, but YMMV.And I'll quickly introduce the remaining contenders:

However, after a reboot, I could no longer reproduce those results and was getting 1m11s as the run time for both OP/janos' and my code.

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
    '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    { ++a[$2] }
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

Then I downloaded the whole dataset (8.8 GB) and followed the steps provided in this answer (except the last step which seemed dangerous) to purge disk I/O caches between each run of commands, and was still getting the same time, now 1m30s.

# rg3.sh
frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

Then I read somewhere that grep may perform faster on compressed files, so I tried it with gzipadam2.sh, is xzadam.sh, patched to utilize all available cores (lzma-n4 -P4, becomes bzip2-n$n -P$n andwhere lzopn=$(nproc), and low and behold,. I didn't test for the gzip one finished in 1m1s andbest possible value for lzop-n was even faster as that would be ad hoc micro-optimization, in so I just made it the same as the number for 43s-P.

# Replace $1 with the compressed tar file
xzgrep -ahE '^\[Result' "$1" |
    mawk -F '[-"]' '{ ++a[$2] }
        END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The mawk-only version can also be made faster (49s for lzop and 1m12s for gzip)And the medians of the running times in seconds:

mawk -F '[-"]' '/Result/ { ++a[$2] }
    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' \
        <(lzop --decompress --stdout "$1")

ChessData-master

rg3.sh       2.53
rg.sh        3.25
op2.sh       3.67
adam2.sh     4.00
adam.sh      6.48
janos.sh    19.71
op.sh       20.05
pzstd.sh    25.88

If one obtained the data on GitHub by downloading it as a GZip file and still has it aroundChessData-master, or could tolerate the roughly 10 minutes recached

rg3.sh       0.91
rg.sh        1.69
op2.sh       2.04
adam2.sh     2.41
adam.sh      4.35
op.sh       12.52
janos.sh    13.11
pzstd.sh    21.54

ChessData-compression time and an additional 2.7 GB disk usagemaster, POSIX commands

op2.sh       3.74
adam2.sh     3.94
adam.sh      6.54
grep.sh     14.09
janos.sh    19.49
op.sh       19.90
grep3.sh    25.02
pzstd.sh    55.96

LumbrasGigaBase

rg3.sh       4.02
rg.sh        4.47
adam.sh     12.09
op2.sh      14.20
adam2.sh    15.82
op.sh       20.76
janos.sh    21.26
pzstd.sh    24.36

Looks like the clear winner is (4.5 GBrg3.sh, and for lzop at default settings)POSIX commands, this is a faster alternativeop2.sh.

I noticed the result lines we're interested in only comprise a small fraction of the contents in those files, so I figured that filtering the input files first for the result lines via the expert tool grep might be more efficient than feeding them directly to awk, which I recall splits each line into fields, so that might incur significant but unnecessary overhead.

I obtained these benchmark results on my first logon session with some missing data (7.5 GB only, due to a full disk):

Adam Drake's solution: 1m39s

 find . -type f -name '*.pgn' -print0 | 
     xargs -0 -n4 -P4 mawk '/Result/ { split($0, a, "-");
         res = substr(a[1], length(a[1]), 1);
         if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++ } 
         END { print white+black+draw, white, black, draw }' | 
     mawk '{games += $1; white += $2; black += $3; draw += $4; } 
         END { print games, white, black, draw }'

OP's solution and janos' improvement: 1m22s-1m30s

My improvement: 1m8s-1m18s

 export LC_ALL=C
 find . -type f -name '*.pgn' -execdir grep -hE '^\[Result' {} + | 
     mawk -F '[-"]' '{ ++a[$2] }
         END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

Here I used a little known trick to speed up grep. I also toyed with grep -r, grep -F and grep -P, which seemed to make no difference at all, but YMMV.

However, after a reboot, I could no longer reproduce those results and was getting 1m11s as the run time for both OP/janos' and my code.

Then I downloaded the whole dataset (8.8 GB) and followed the steps provided in this answer (except the last step which seemed dangerous) to purge disk I/O caches between each run of commands, and was still getting the same time, now 1m30s.

Then I read somewhere that grep may perform faster on compressed files, so I tried it with gzip, xz, lzma, bzip2 and lzop, and low and behold, the gzip one finished in 1m1s and lzop was even faster, in 43s.

# Replace $1 with the compressed tar file
xzgrep -ahE '^\[Result' "$1" |
    mawk -F '[-"]' '{ ++a[$2] }
        END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The mawk-only version can also be made faster (49s for lzop and 1m12s for gzip):

mawk -F '[-"]' '/Result/ { ++a[$2] }
    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' \
        <(lzop --decompress --stdout "$1")

If one obtained the data on GitHub by downloading it as a GZip file and still has it around, or could tolerate the roughly 10 minutes re-compression time and an additional 2.7 GB disk usage (4.5 GB for lzop at default settings), this is a faster alternative.

Potential Bugs

Your solution, while able to eliminate one piped command relative to Adam's, introduces a new limitation: the output may consists of multiple intermediate results rather than one final statistics. From the xargs man page:

-s max-chars, --max-chars=max-chars
       Use at most max-chars characters per command line, including the command 
       and initial-arguments and the terminating nulls at the ends of the 
       argument strings.  The largest allowed value is system-dependent, and is 
       calculated as the argument length limit for exec, less the size of your 
       environment, less 2048 bytes of headroom.  If this value is more than 
       128KiB, 128Kib is used as the default value; otherwise, the default 
       value is the maximum.  1KiB is 1024 bytes.  xargs automatically adapts 
       to tighter constraints.

I've inadvertently hit the default maximum max-chars limit while testing; you may force the issue by:

creating a directory with a very long path name to hold the data repository and running your commands from there,

increasing the size of your environment by populating it with random data:
```
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
```

decreasing the maximum stack size temporarily by running
```
ulimit -S -s STACK_SIZE
```
where STACK_SIZE is some number lower than the default 8192, e.g. 256,

letting the number of PGN files grow organically,

or some combination of the above.

I'm quite surprised that your attempt at parallel execution slowed you down. If I adapt your changes to Adam's code, it actually ranks among the fastest solutions I've tried. Here's what you'd have come up with:

# op2.sh
n=$(nproc)
find . -type f -name '*.pgn' -print0 | 
    xargs --null -n $n --max-procs=$n mawk -F '[-"]' \
        'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
         /Result/{ ++a[$2] } 
         END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' |
    mawk '{ games += $1; white += $2; black += $3; draw += $4; } 
          END { print games, white, black, draw }'

Notice here I've added a BEGIN rule to initialize the array. This is done to ensure the statistics will be correct: if the collection of files fed to an invocation of AWK is devoid of certain outcomes (white wins, white loses, or a draw), the slots counting the missing outcomes would output empty strings, which when passed to the final aggregating AWK program would result in misgrouping and tainting the counts. You can see this effect come into play by changing the -n4 option of xargs to -n1 or -n2 in Adam's solution reproduced below:

# adam.sh
find . -type f -name '*.pgn' -print0 | 
    xargs -0 -n4 -P4 mawk \
        '/Result/{ 
            split($0, a, "-");
            res = substr(a[1], length(a[1]), 1);
            if (res == 1) white++; 
            if (res == 0) black++; 
            if (res == 2) draw++;
         } 
         END { print white + black + draw, white, black, draw }' | 
    mawk '{ games += $1; white += $2; black += $3; draw += $4; } 
          END { print games, white, black, draw }'

Better Tools

Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to AWK, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!

Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk2 based on mawk 1.3.3. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers, one per line, it runs slower in all backend, optimization and parallelization configurations.

Compression

Before I present my benchmark results, I want to touch on one technique that may improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.

We may obtain the data on GitHub by downloading as a gzip file and decompressing it using pigz (parallel gzip):

tar --extract --use-compress-program='pigz' \
    --file='./ChessData-master.tar.gz'

This runs in 45 seconds on my machine. GZIP is not the most optimal format, neither is pigz the most parallel decompressor. We could switch to zstandard by recompressing the data with pzstd:

tar --create --use-compress-program='pzstd' \
    --file='./ChessData-master.tar.zst' ./ChessData-master

This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not truly benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.

A solution to the problem utilizing pzstd is:

# pzstd.sh
tar --extract --to-stdout --use-compress-program='pzstd' \
    --wildcards --wildcards-match-slash \
    --file='./ChessData-master.tar.zst' '*.pgn' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    /Result/{ ++a[$2] } 
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The Benchmarks

Benchmark PC1 specs:

CPU: 8
RAM: 16 GB

Disk: NVMe SSD

OS: Fedora 40

Data sources:

ChessData-master
LumbrasGigaBase (pre-1899 to 2023, Version 2024-04-09)

Benchmark types:

Normal: before each run, instruct the OS to drop caches.

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

Cached: do not drop caches before runs.

POSIX commands: only POSIX utilities are allowed; substitute mawk and frawk with awk (gawk), rg with grep. grep.sh and grep3.sh are just POSIX equivalents of rg.sh and rg3.sh because their command-lines are incompatible.

And I'll quickly introduce the remaining contenders:

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
    '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
    mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
                    { ++a[$2] }
                    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

# rg3.sh
frawk -F ':' '/w:/{ white = $2 } 
              /b:/{ black = $2 } 
              /d:/{ draw = $2 } 
              END { print white + black + draw, white, black, draw }' \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1-0"\]$' | 
        frawk '{ s += $0 } END { print "w:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "0-1"\]$' | 
        frawk '{ s += $0 } END { print "b:" s }') \
    <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
         --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
        frawk '{ s += $0 } END { print "d:" s }')

adam2.sh is adam.sh patched to utilize all available cores (-n4 -P4 becomes -n$n -P$n where n=$(nproc). I didn't test for the best possible value for -n as that would be ad hoc micro-optimization, so I just made it the same as the number for -P.

And the medians of the running times in seconds:

ChessData-master

rg3.sh       2.53
rg.sh        3.25
op2.sh       3.67
adam2.sh     4.00
adam.sh      6.48
janos.sh    19.71
op.sh       20.05
pzstd.sh    25.88

ChessData-master, cached

rg3.sh       0.91
rg.sh        1.69
op2.sh       2.04
adam2.sh     2.41
adam.sh      4.35
op.sh       12.52
janos.sh    13.11
pzstd.sh    21.54

ChessData-master, POSIX commands

op2.sh       3.74
adam2.sh     3.94
adam.sh      6.54
grep.sh     14.09
janos.sh    19.49
op.sh       19.90
grep3.sh    25.02
pzstd.sh    55.96

LumbrasGigaBase

rg3.sh       4.02
rg.sh        4.47
adam.sh     12.09
op2.sh      14.20
adam2.sh    15.82
op.sh       20.76
janos.sh    21.26
pzstd.sh    24.36

Looks like the clear winner is rg3.sh, and for POSIX commands, op2.sh.

added 1041 characters in body

Source Link

edited Dec 11, 2017 at 23:14

Gao

1.2k
9
21

This might be highly dependent on the data and system, butI obtained these benchmark results on my computerfirst logon session with some missing data (7.5 GB only, I'm consistently getting these resultsdue to a full disk):

From my testingHowever, after a reboot, I found that thecould no longer reproduce those results and was getting mawk1m11s command completed within 5 seconds if givenas the already filtered result setrun time for both OP/janos' and my code. This means

Then I downloaded the majoritywhole dataset (8.8 GB) and followed the steps provided in this answer (except the last step which seemed dangerous) to purge disk I/O caches between each run of commands, and was still getting the same time is spent in, now grep1m30s and there should be a way to parallelize it. Unfortunately

Then I read somewhere that grep may perform faster on compressed files, so I couldn't seem to achievetried it with xargs -Pgzip, possibly because my virtual system onlyxz, lzma, bzip2 and lzop, and low and behold, the gzip one finished in 1m1s and lzop was even faster, in 43s.

# Replace $1 with the compressed tar file
xzgrep -ahE '^\[Result' "$1" |
    mawk -F '[-"]' '{ ++a[$2] }
        END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The mawk-only version can also be made faster (49s for lzop and 1m12s for gzip):

mawk -F '[-"]' '/Result/ { ++a[$2] }
    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' \
        <(lzop --decompress --stdout "$1")

If one obtained the data on GitHub by downloading it as a GZip file and still has 1 processor availableit around, or could tolerate the roughly 10 minutes re-compression time and an additional 2.7 GB disk usage (4.5 GB for lzop at default settings), this is a faster alternative.

This might be highly dependent on the data and system, but on my computer, I'm consistently getting these results:

From my testing, I found that the mawk command completed within 5 seconds if given the already filtered result set. This means the majority of time is spent in grep and there should be a way to parallelize it. Unfortunately, I couldn't seem to achieve it with xargs -P, possibly because my virtual system only has 1 processor available.

I obtained these benchmark results on my first logon session with some missing data (7.5 GB only, due to a full disk):

However, after a reboot, I could no longer reproduce those results and was getting 1m11s as the run time for both OP/janos' and my code.

Then I downloaded the whole dataset (8.8 GB) and followed the steps provided in this answer (except the last step which seemed dangerous) to purge disk I/O caches between each run of commands, and was still getting the same time, now 1m30s.

Then I read somewhere that grep may perform faster on compressed files, so I tried it with gzip, xz, lzma, bzip2 and lzop, and low and behold, the gzip one finished in 1m1s and lzop was even faster, in 43s.

# Replace $1 with the compressed tar file
xzgrep -ahE '^\[Result' "$1" |
    mawk -F '[-"]' '{ ++a[$2] }
        END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The mawk-only version can also be made faster (49s for lzop and 1m12s for gzip):

mawk -F '[-"]' '/Result/ { ++a[$2] }
    END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' \
        <(lzop --decompress --stdout "$1")

If one obtained the data on GitHub by downloading it as a GZip file and still has it around, or could tolerate the roughly 10 minutes re-compression time and an additional 2.7 GB disk usage (4.5 GB for lzop at default settings), this is a faster alternative.

added 3 characters in body

Source Link

edited Dec 10, 2017 at 10:23

Gao

1.2k
9
21

Loading

Source Link

answered Dec 10, 2017 at 10:10

Gao

1.2k
9
21

Loading

Stack Exchange Network

Return to Answer

TL;DR

TL;DR

Potential Bugs

Better Tools

Compression

The Benchmarks

Potential Bugs

Better Tools

Compression

The Benchmarks