Return to Answer

Added information and link from comment

Source Link

edit approved Dec 22, 2021 at 20:53

I wrote a utility in C++ which basically opens all the files and reads one one record at a time from each. At each step, it finds the alphabetically earliest one (using the SHELL sort) and writes that record. It reads in the next record from that file and any other files that also had the same record. It ran for 5 hours on a new MAC laptop in order to get the answer. Memory

Memory requirements are not large and each file is only read once. It runs far faster than a comm solution which is limited to two files at a time and involves multiple reading of files.

The program has been compiled and run on two computers: A MAC laptop where the program was originally developed, and a MAC M1. The largest job run so far had 676 files, each with about 60 million records or size 1.5 GB and it ran in a little over 10 hours.

Source code: bruceamos/comb

I have a similar problem while trying to solve a very large sliding block puzzle. Currently I have to merge about 100 sorted files, each containing about 60 million positions and occupying 15 gigabytes. The files are individually sorted without duplicates, but different files can have the same record. I wrote a utility in C++ which basically opens all the files and reads one record at a time from each. At each step, it finds the alphabetically earliest one (using the SHELL sort) and writes that record. It reads in the next record from that file and any other files that also had the same record. It ran for 5 hours on a new MAC laptop in order to get the answer. Memory requirements are not large and each file is only read once. It runs far faster than a comm solution which is limited to two files at a time and involves multiple reading of files.

I wrote a utility in C++ which basically opens all the files and reads one record at a time from each. At each step, it finds the alphabetically earliest one (using the SHELL sort) and writes that record. It reads in the next record from that file and any other files that also had the same record. It ran for 5 hours on a new MAC laptop in order to get the answer.

Memory requirements are not large and each file is only read once. It runs far faster than a comm solution which is limited to two files at a time and involves multiple reading of files.

Source code: bruceamos/comb

Source Link

answered Nov 29, 2021 at 22:40

Bruce Amos