TL;DR
frawk -F ':' '/w:/{ white = $2 }
/b:/{ black = $2 }
/d:/{ draw = $2 }
END { print white + black + draw, white, black, draw }' \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1-0"\]$' |
frawk '{ s += $0 } END { print "w:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "0-1"\]$' |
frawk '{ s += $0 } END { print "b:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1/2-1/2"\]$' |
frawk '{ s += $0 } END { print "d:" s }')
Finishes in 2.53 seconds versus OP's (20.05 seconds), jano's (19.71 seconds) and Adam's (6.48 seconds) solutions.
I've inadvertently hit the default maximum max-chars limit while testing;testing.
youYou may force the issue by:
- creating a directory with a very long path name to hold the data repositoryset and running your commands from there,
- increasing the size of your environment by populating it with random data:
export __=$(perl -e 'print "x" x $ARGV[0]' 100000) - decreasing the maximum stack size temporarily by running
whereulimit -S -s STACK_SIZESTACK_SIZEis some number lower than the default 8192, e.g. 256, - letting the number of PGN files grow organically,
Notice here I've added a BEGIN rule to initialize the array.
This is done to ensure the statistics will be correct:
if the collection of files fed to an invocation of AWKawk is devoid of certain
outcomes (white wins, white loses, or a draw),
the slots counting the missing outcomes would output empty strings,
which when passed to the final aggregating AWK program
would result in misgrouping and tainting the counts.
You can see this effect come into play by changing the -n4 option of xargs
to -n1 or -n2 in Adam's solution reproduced below:
A fixed string search for Result, while fast and simple,
is a bit naive.
In a prior version of the file Britbase/197008bcf.pgn,
there is a comment in the movetext that includes both the string Result and a score 1-0:
1. e4 c6 2. d4 g6 3. Nc3 Bg7 4. Be3 d6 5. Qd2 Nd7 6. f3 Qa5 7. Bc4 b5 8. Bb3 b4
9. Nd1 Ba6 10. Ne2 Ngf6 11. c3 c5 12. cxb4 cxb4 13. Nf2 O-O 14. O-O Rfc8 15.
Rfc1 Nb6 16. Nf4 Nc4 17. Bxc4 Bxc4 18. N4d3 Rab8 19. a3 Qb5 20. Nxb4 a5 21.
Nbd3 Nd7 22. b4 a4 23. Nb2 Bb3 24. Rxc8+ Rxc8 25. Rc1 Rxc1+ 26. Qxc1 Qe2 27.
Nfd1 {Result given as 1-0 in the tournament bulletin but it is clear from
published results and crosstables, as well as the position on the board (where
Black has mate in one) that the true result was 0-1 - JS, BritBase} 0-1
This is a rare occurrence, indeed, but I'd personally perfer the more explicit search pattern $'^\[Result "(1-0|0-1|1/2-1/2)"\]\r*$' that adheres to the PGN export format described in the PGN standard document.
It is still possible to have false negatives, but trying to match the versatile PGN import format is likely to hurt performance and needlessly increase complexity.
Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to AWKawk, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!
Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk2mawk 2 based on mawk 1.3.33's code. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers, one (one per line), it runs slower in all backend, optimization and parallelization configurations.
After ample experimentation, I came up with two solutions that are blazingly fast on modern hardware:
# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
'^\[Result "(1-0|0-1|1/2-1/2)"\]$' |
mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 }
{ ++a[$2] }
END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'
# rg3.sh
frawk -F ':' '/w:/{ white = $2 }
/b:/{ black = $2 }
/d:/{ draw = $2 }
END { print white + black + draw, white, black, draw }' \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1-0"\]$' |
frawk '{ s += $0 } END { print "w:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "0-1"\]$' |
frawk '{ s += $0 } END { print "b:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1/2-1/2"\]$' |
frawk '{ s += $0 } END { print "d:" s }')
Before I present my benchmark results, I want to touch on one techniquemore idea that maymight improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.
We may obtaindownload the data set on GitHub by downloading asas a gzip file and decompressingdecompress it using pigz (parallel gzip):
This runs in 45 seconds on my machine. GZIP is not the most optimal lossless data compression file format, neither is pigz the most parallel decompressor. We could switch to zstandardZstandard by recompressing the data set with pzstd:
This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not trulyfully benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.
A solution to the problem statement utilizing pzstd is:
Data sourcessets:
- Normal: before each run, instruct the OS to drop caches.
sync echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null - Cached: do not drop caches before runseach run.
- POSIX commands: only POSIX utilities are allowed;
substitute
mawkandfrawkwithawk(gawk), andrgwithgrep. Due to incompatibilities in the command-lines,grep.shrgand (ripgrep) could not simply be replaced withgrep3.shgrepare just POSIX equivalents of, sorg.shandrg3.shbecause their command-lines are incompatible had to be rewritten asgrep.shandgrep3.sh.
And I'll quickly introduce the remaining contenders:
# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
'^\[Result "(1-0|0-1|1/2-1/2)"\]$' |
mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 }
{ ++a[$2] }
END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'
# rg3.sh
frawk -F ':' '/w:/{ white = $2 }
/b:/{ black = $2 }
/d:/{ draw = $2 }
END { print white + black + draw, white, black, draw }' \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1-0"\]$' |
frawk '{ s += $0 } END { print "w:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "0-1"\]$' |
frawk '{ s += $0 } END { print "b:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1/2-1/2"\]$' |
frawk '{ s += $0 } END { print "d:" s }')
Note: adam2.sh is adam.sh patched to utilize all available cores
(-n4 -P4 becomes -n$n -P$n where n=$(nproc).
I didn't test for the best possible value for -n
as that would be ad hoc micro-optimization,
so I just made it the same as the number for -P.
And the medians ofLet's examine the median running times of each contender in seconds:the benchmarks.
ChessData-master:
ChessData-master, cached:
ChessData-master, POSIX commands:
LumbrasGigaBase:
Looks like theThe clear winner is rg3.sh,
and for POSIX commands, op2.sh.



