3

fileA contains ~100k strings (people names, a-zA-Z only)

fileB has ~100M lines

Programs

There are only two programs:

  • replace a string with a single dot
  • replace a string with dots of the same length

Algorithm

for each lineB in fileB do
   for each lineA in fileA do
      if lineA matches with lineB; then
         replace the match in lineB with dots
         append the modified lineB' to file "res-length" or "res-single", depending on the program
      fi
   done
done

The straightforward solution is very slow.

Matching should be case insensitive.

Any additional Linux app (gawk, etc.) can be additionally installed.

Example

$ cat fileA
agnes
Ari
Vika

$ cat fileB
12vika1991
ariagnes#!
ari45
lera56er

The output should be two files, corresponding to each program:

$ cat res-single  # replace a string with a single dot
12.1991
.agnes#!
ari.#!
.45

$ cat res-length  # replace a string with dots of the same length
12...1991
...agnes#!
ari.....#!
...45

A simplified version of the task asks to output the only the first match. So for program #2 instead of ...agnes#! and ari.....#! it's sufficient to output only ari.....#!

Simplified task algorithm

for each lineB in fileB do
   find the first lineA in fileA that matches lineB
   if lineA is found; then
      replace the match in lineB with dots
      append the modified lineB' to file "res-length" or "res-single", depending on the program
   fi
done

Python implementation

def create_masks(wordlist=WordListDefault.TOP1M.path, replace_char='.'):
    # fileA lowercase
    names = PATTERNS_PATH.read_text().splitlines()

    masks_length = []
    masks_single = []
    with codecs.open(wordlist, 'r', encoding='utf-8', errors='ignore') as infile:
        for line in infile:
            line_lower = line.lower()
            for name in names:
                i = line_lower.find(name)
                if i != -1:
                    ml = f"{line[:i]}{replace_char * len(name)}{line[i + len(name):]}"
                    ms = f"{line[:i]}{replace_char}{line[i + len(name):]}"
                    masks_length.append(ml)
                    masks_single.append(ms)

    with open(MASKS_LENGTH, 'w') as f:
        f.writelines(masks_length)
    with open(MASKS_SINGLE, 'w') as f:
        f.writelines(masks_single)


if __name__ == '__main__':
    create_masks()

For 1.6M fileA and 1k fileB it takes about 3 min, which decreases to only 10 sec followed by grep -iF -f fileA fileB > fileB.filtered.

@Ned64 was right by saying the fastest approach would be just straightforward C, which is not the topic of this forum.

Current python implementation will take 52 days to process 2B lines of fileB with 35k strings from fileA. I'm not sure anymore whether plain C could do this in an hour. I'm wondering if CUDA is a way to go ...

9
  • 1
    Please describe more clearly the rules the code should follow to replace text. Add information by editing your question. Commented Jun 30, 2020 at 21:08
  • Added. Is there still missing information? Commented Jun 30, 2020 at 21:19
  • I understand what you want now. However, unless you only match words/patterns at the beginning of the line (in which case sorting will help) you will be stuck at O(n²). Tip: Write it in C, it will be faster than your script. Commented Jun 30, 2020 at 21:20
  • One could perhaps speed up things by creating a hash table e.g. of two adjoining characters (=> hash code) and where they appear in the data (pointer to file in RAM). That should speed things up. Again, sounds more like C/Python/whatever than scripting. Commented Jun 30, 2020 at 21:33
  • I added a simplified version. Does it simplify the algorithm complexity and open a way for sed, awk, or perl commands? Commented Jun 30, 2020 at 21:34

3 Answers 3

1
$ cat tst.awk
BEGIN {
    dots = sprintf("%*s",1000,"")
    gsub(/ /,".",dots)
    resSingle = "res-single"
    resLength = "res-length"
}
{ lc = tolower($0) }
NR==FNR {
    lgth = length($0)
    str2lgth[lc] = lgth
    str2dots[lc] = substr(dots,1,lgth)
    next
}
{
    for (str in str2lgth) {
        if ( s=index(lc,str) ) {
            bef = substr($0,1,s-1)
            aft = substr($0,s+str2lgth[str])
            print bef "." aft > resSingle
            print bef str2dots[str] aft > resLength
        }
    }
}

.

$ awk -f tst.awk fileA fileB

$ cat res-single
12.1991
ari.#!
.agnes#!
.45

$ cat res-length
12....1991
ari.....#!
...agnes#!
...45

The above assumes that no line in fileA will be longer than 1000 characters, if that's wrong pick a bigger number or we can add code to calculate it if necessary. It also assumes you don't care what order the lines from fileA are looked for in fileB and that you want to do a string rather than regexp comparison, both again trivial tweaks if it's not what you want.


EDIT in response to your comment below, here's how to modify the above if you can't statically define max length for the lines from fileA (not even 100,000 chars?) and so need to figure out the max, and the lines from fileA are all lower case:

NR==FNR {
    lgth = length($0)
    str2lgth[$0] = lgth
    maxLgth = (lgth > maxLgth ? lgth : maxLgth)
    next
}
FNR==1 {
    dots = sprintf("%*s",maxLgth,"")
    gsub(/ /,".",dots)
    for ( str in str2lgth ) {
        str2dots[str] = substr(dots,1,str2lgth[str])
    }
    resSingle = "res-single"
    resLength = "res-length"
}
{
    lc = tolower($0)
    for (str in str2lgth) {
        if ( s=index(lc,str) ) {
            bef = substr($0,1,s-1)
            aft = substr($0,s+str2lgth[str])
            print bef "." aft > resSingle
            print bef str2dots[str] aft > resLength
        }
    }
}
8
  • 1
    Nice! 3 sec and 02:05 min compared to 10 sec and 9:30 min of Python. Commented Jul 1, 2020 at 16:49
  • It throws a warning however saying "awk: tst.awk:7: (FILENAME=fileB FNR=606894) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale." Commented Jul 1, 2020 at 16:50
  • @dizcza google says... stackoverflow.com/q/40049546/1745001. So apparently you just need to set LC_ALL=C (which is almost always good advice anyway unless you have a specific reason not to). Commented Jul 1, 2020 at 16:53
  • 1
    Yeap. I should have googled it. Thanks. Commented Jul 1, 2020 at 18:57
  • 1
    The max calculation happens once per line of fileA, not once total for all of fileA, and if we can't have the dots string populated before we read fileA then, as you can see in the new script, we need to loop through every value we read from fileA again to populate the str2dots array instead of doing it when we read each line the first time and there we have to access the str2lgth array to get the length for that string when on the first pass we had it in the lgth scalar variable so it will impact performance even if not by much. Commented Jul 4, 2020 at 12:06
1

You may use a simple Perl based approach here.

Method:

populate a hash %h whose keys are the lowercased lines (without the newlines) of fileA and values are the equivalent number of dots.

Then for every line of fileB, we test whether any key of hash %h is present in a case insensitive manner. If yes, then we print the prematch, match, and postmatch data to the res-single and res-length files. Incase you want only the first match, uncomment the "last" statement.

$ perl -Mautodie -lne '
    BEGIN {
     open *{"FH$_"}, ">", qw[res-single res-length][$_] for 0..1;
     do{
       local @ARGV = pop;
       $h{do{chomp;lc;}} = s/././gr =~ tr/\n//dr while <>;
       @h = keys %h;
      };
    }
    for my $h ( @h ) {
      if ( /\Q$h/pi ) {
        my($p, $q) = (${^PREMATCH}, ${^POSTMATCH});
        print {*{"FH$_"}} $p, (".", $h{$h})[$_], $q for 0..1;
        #last;
      }
    }
' fileB fileA

$ more res-*

::::::::::::::
res-length
::::::::::::::
12....1991
ari.....#!
...agnes#!
...45

::::::::::::::
res-single
::::::::::::::
12.1991
ari.#!
.agnes#!
.45
1
  • I'm sorry to say, but your solution is slow, compared even to Python. For 1.6M fileB and 1k fileA it took about 10 min, compared to 3 min of straightforward python implementation. It seems like the fastest solution is C, which is not the topic of this forum. Thank you for trying though. Commented Jul 1, 2020 at 7:11
0

Optimized C solution https://github.com/dizcza/people-names-as-passwords/blob/master/src/create_masks.c

I used a trie data structure that allowed me to parse 2B lines of fileB with 43k lines of fileA only in 12 minutes!

Thanks everyone for the input.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.