grep crashed with too much ram usage as it operated thru gig sized files

Question

I have a grep command

grep -Fvf cleaned1 cleanedR > cleaned2

that runs and kills my PC too much ram usage

cleanedR is a list of files (14 million of them) that I need to run some operation thru (dowork.sh cleanedR), everything that has been completed is printed into cleaned1 (in a different sort order, so diff wont work)
cleaned1 is a list of files (10 million)
I had to cancel the dowork.sh operation, to do something else, but I can resume it later thru another list (dowork.sh cleaned2). cleaned2 doesnt exist yet
cleaned2 will be a list of 4 million files which I have yet to run dowork.sh thru.
Essentially I need to do this mathematical operation (its a subtract operation): list of files cleanedR - list of files cleaned1 = list of files cleaned2

cleaned1 and cleanedR are files containing absolute file structure, with millions of files, these are big files. cleaned1 is 1.3G and cleanedR is 1.5G.

I have like 30 G of ram available but it used up all of that and crashed

I was thinking why does grep use ram on this, can I make ram use some other temp directory. Sort has that option with -T. So I was looking for a similar way for grep.

I am open to other ideas.

-f runs thru cleaned1 which contains millions of expressions (file names) instead of string regular expression. 1 file per line -F does full match on the line. filenames can be complex and grep can mistake some chars for regular expression chars, we dont want that so we do a full line match. -v is the subtract / exclude operation

Why do you ask the same question twice? What does kills my PC mean? Does grep crash with out of memory? Does the machine freeze? Or (most likely) does your grep simply take hours or days to process (processing a 1.3 GB file 14 million times -- what do you expect!)? — Philippos
– Philippos, Commented Aug 16, 2017 at 5:16
I couldn't find my previous post, I thought it was lost, I posted it before making an account. I couldn't find it in my emails or with a quick search. Jeez some of you have no patients or understanding to people just starting to use the system, and assume everyone uses this system, and knows all the rules perfectly. I really dislike the pridefulness, hostility and nastiness in most of this community. Kills my PC means it frooze up completely, couldnt see the mouse or anything (that was on cygwin), on the FreeBSD box it crashed/killed the process and took out other process with it (mem mgmt) — user246676
– user246676, Commented Aug 16, 2017 at 23:03

xhienne · Accepted Answer · 2017-08-16 23:02:07Z

1

First of all, assuming lines of cleaned1 must match an entire line in cleanedR, you may benefit from using grep -x.

With the same assumption, if you can manage to sort your two huge files cleaned1 and cleanedR, you can replace grep with comm -1 -3 cleaned1 cleanedR which will be quite fast.

Else, you can split cleaned1 (e.g. split -l 100000) and operate by chunks. You can even chain those greps (i.e. grep -Fvxf chunk1 cleanedR | grep -Fvxf chunk2 | ... > cleaned2) and thus parallelizing over several CPUs.

edited Aug 16, 2017 at 23:02

answered Aug 16, 2017 at 0:53

xhienne

18.3k2 gold badges58 silver badges71 bronze badges

The FreeBSD system I am using doesn't come with comm or other tools i like such as uniq <uniq -c is awesome :)>. I wanted to use comm -1 -3. I just realized I can transfer the files to my linux box. Before that I will try your method of splitting, I wasnt sure if it was over-ramming from cleaned1 or cleanedR, I believe its cleaned1, so if I make it smaller with split as you mentioned it will not eat so much rams, Thank you so much!

user246676
– user246676

2017-08-16 22:57:11 +00:00
Commented Aug 16, 2017 at 22:57
I'd say too that the culprit is cleaned1 that is entirely loaded in memory in a huge regex, whereas cleanedR is parsed one line at a time. I'm quite surprised that your system does not provide a POSIX tool like comm.

xhienne
– xhienne

2017-08-16 23:05:13 +00:00
Commented Aug 16, 2017 at 23:05
Welcome to FreeBSD version 8 :( the pain is real. Next release has FreeBSD 10.3 so it will have it. Also I dont have package install abilities (well I do but not supposed to use em). So my best other option is to compress, transfer to linux box, and run thru there where I get those tools.

user246676
– user246676

2017-08-17 23:42:57 +00:00
Commented Aug 17, 2017 at 23:42

Add a comment |

Stack Exchange Network

grep crashed with too much ram usage as it operated thru gig sized files

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

grep crashed with too much ram usage as it operated thru gig sized files

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions