fileA contains ~100k strings (people names, a-zA-Z only)
fileB has ~100M lines
Programs
There are only two programs:
- replace a string with a single dot
- replace a string with dots of the same length
Algorithm
for each lineB in fileB do
for each lineA in fileA do
if lineA matches with lineB; then
replace the match in lineB with dots
append the modified lineB' to file "res-length" or "res-single", depending on the program
fi
done
done
The straightforward solution is very slow.
Matching should be case insensitive.
Any additional Linux app (gawk, etc.) can be additionally installed.
Example
$ cat fileA
agnes
Ari
Vika
$ cat fileB
12vika1991
ariagnes#!
ari45
lera56er
The output should be two files, corresponding to each program:
$ cat res-single # replace a string with a single dot
12.1991
.agnes#!
ari.#!
.45
$ cat res-length # replace a string with dots of the same length
12...1991
...agnes#!
ari.....#!
...45
A simplified version of the task asks to output the only the first match. So for program #2 instead of ...agnes#! and ari.....#! it's sufficient to output only ari.....#!
Simplified task algorithm
for each lineB in fileB do
find the first lineA in fileA that matches lineB
if lineA is found; then
replace the match in lineB with dots
append the modified lineB' to file "res-length" or "res-single", depending on the program
fi
done
Python implementation
def create_masks(wordlist=WordListDefault.TOP1M.path, replace_char='.'):
# fileA lowercase
names = PATTERNS_PATH.read_text().splitlines()
masks_length = []
masks_single = []
with codecs.open(wordlist, 'r', encoding='utf-8', errors='ignore') as infile:
for line in infile:
line_lower = line.lower()
for name in names:
i = line_lower.find(name)
if i != -1:
ml = f"{line[:i]}{replace_char * len(name)}{line[i + len(name):]}"
ms = f"{line[:i]}{replace_char}{line[i + len(name):]}"
masks_length.append(ml)
masks_single.append(ms)
with open(MASKS_LENGTH, 'w') as f:
f.writelines(masks_length)
with open(MASKS_SINGLE, 'w') as f:
f.writelines(masks_single)
if __name__ == '__main__':
create_masks()
For 1.6M fileA and 1k fileB it takes about 3 min, which decreases to only 10 sec followed by grep -iF -f fileA fileB > fileB.filtered.
@Ned64 was right by saying the fastest approach would be just straightforward C, which is not the topic of this forum.
Current python implementation will take 52 days to process 2B lines of fileB with 35k strings from fileA. I'm not sure anymore whether plain C could do this in an hour. I'm wondering if CUDA is a way to go ...