Bash scripting and large files (bug): input with the read builtin from a redirection gives unexpected result

Question

I have a strange issue with large files and bash. This is the context:

I have a large file: 75G and 400,000,000+ lines (it is a log file, my bad, I let it grow).
The first 10 chars of each line is a time stamps in the format YYYY-MM-DD.
I want to split that file: one file per day.

I tried with the following script that did not work. My question is about this script not working, not alternative solutions.

while read line; do
  new_file=${line:0:10}_file.log
  echo "$line" >> $new_file
done < file.log

After debugging, I found the problem in the new_file variable. This script:

while read line; do
  new_file=${line:0:10}_file.log
  echo $new_file
done < file.log | uniq -c

gives the result bellow (I put the xes to keep the data confidential, other chars are the real ones). Notice the dh and the shorter strings:

...
  27402 2011-xx-x4
  27262 2011-xx-x5
  22514 2011-xx-x6
  17908 2011-xx-x7
...
3227382 2011-xx-x9
4474604 2011-xx-x0
1557680 2011-xx-x1
      1 2011-xx-x2
      3 2011-xx-x1
...
     12 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1
      1 208--
      1 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1    
...

It is not a problem in the format of my file. The script cut -c 1-10 file.log | uniq -c gives only valid time stamps. Interestingly, a part of the the above output becomes with cut ... | uniq -c:

3227382 2011-xx-x9
4474604 2011-xx-x0
5722027 2011-xx-x1

We can see that after the uniq count 4474604, my initial script failed.

Did I hit a limit in bash that I do not know, did I find a bug in bash (it seams unlikely), or have I done something wrong ?

Update:

The problem happens after reading 2G of the file. It seams read and redirection do not like larger files than 2G. But still searching for a more precise explanation.

Update2:

It definitively looks like a bug. It can be reproduced with:

yes "0123456789abcdefghijklmnopqrs" | head -n 100000000 > file
while read line; do file=${line:0:10}; echo $file; done < file | uniq -c

but this works fine as a workaround (it seams that I found a useful use of cat):

cat file | while read line; do file=${line:0:10}; echo $file; done | uniq -c

A bug has been filed to GNU and Debian. Affected versions are bash 4.1.5 on Debian Squeeze 6.0.2 and 6.0.4.

echo ${BASH_VERSINFO[@]}
4 1 5 1 release x86_64-pc-linux-gnu

Update3:

Thanks to Andreas Schwab who reacted quickly to my bug report, this is the patch that is the solution to this misbehavior. The impacted file is lib/sh/zread.c as Gilles pointed out sooner:

diff --git a/lib/sh/zread.c b/lib/sh/zread.c index 0fd1199..3731a41 100644
--- a/lib/sh/zread.c
+++ b/lib/sh/zread.c @@ -161,7 +161,7 @@ zsyncfd (fd)
      int fd; {   off_t off;
-  int r;
+  off_t r;

  off = lused - lind;   r = 0;

The r variable is used to hold the return value of lseek. As lseek returns the offset from the beginning of the file, when it is over 2GB, the int value is negative, which causes the test if (r >= 0) to fail where it should have succeed.

Can you replicate the problem with smaller sets of input data? Is it always the same input lines that result in these problems? — larsks
– larsks, Commented Mar 1, 2012 at 20:57
@larks: good question. The problem always begins at line # 13.520.918 (twice actually for the tests I did). The size of the file before this line is 2.147.487.726. It seams that there is a 32 bits limit here, but not exactly as we are a little over 2^31 (2.147.483.648), but right at a 4K buffer limit (2^31 + 4K = 2.147.487.744). The previous and next lines are normal 100 to 200 character lines. — jfg956
– jfg956, Commented Mar 1, 2012 at 21:28
Tested on a 2nd file (about same size): the problem begins at line # 13.522.712, and the file is 2.147.498.679 bytes large before that line. It seams to point in the direction of a limit of the read statement in bash. — jfg956
– jfg956, Commented Mar 1, 2012 at 21:41

Gilles 'SO- stop being evil' · Accepted Answer · 2012-03-02 00:45:46Z

You've found a bug in bash, of sorts. It's a known bug with a known fix.

Programs represent an offset in a file as a variable in some integer type with a finite size. In the old days, everyone used int for just about everything, and the int type was limited to 32 bits, including the sign bit, so it could store values from -2147483648 to 2147483647. Nowadays there are different type names for different things, including off_t for an offset in a file.

By default, off_t is a 32-bit type on a 32-bit platform (allowing up to 2GB), and a 64-bit type on a 64-bit platform (allowing up to 8EB). However, it's common to compile programs with the LARGEFILE option, which switches the type off_t to being 64 bits wide and makes the program call suitable implementations of functions such as lseek.

It appears that your're running bash on a 32-bit platform and your bash binary is not compiled with large file support. Now, when you read a line from a regular file, bash uses an internal buffer to read characters in batches for performance (for more details, see the source in builtins/read.def). When the line is complete, bash calls lseek to rewind the file offset back to the position of the end of the line, in case some other program cared about the position in that file. The call to lseek happens in the zsyncfc function in lib/sh/zread.c.

I haven't read the source in much detail, but I surmise that something is not happening smoothly at the point of transition when the absolute offset is negative. So bash ends up reading at the wrong offsets when it refills its buffer, after it's passed the 2GB mark.

If my conclusion is wrong and your bash is in fact running on a 64-bit platform or compiled with largefile support, that is definitely a bug. Please report it to your distribution or upstream.

A shell is not the right tool to process such large files anyway. It's going to be slow. Use sed if possible, otherwise awk.

Merci Gilles. Great answer: complete, with enough information to understand the issue even to people without strong CS background (32 bits...). (larsks also help in questioning on the line number, and it should be acknowledged.) After that, I also though of a 32 bit problem and download the source, but was not yet to this level of analysis. Merci encore, et bonne journée. — jfg956
– jfg956, Commented Mar 2, 2012 at 8:05

larsks · Accepted Answer · 2012-03-01 20:43:19Z

4

I don't know about wrong, but it's certainly convoluted. If your input lines look like this:

YYYY-MM-DD some text ...

Then there's really no reason for this:

new_file=${line:0:4}-${line:5:2}-${line:8:2}_file.log

You're doing a lot of substring work to end up with something that looks...exactly the way it already looks in the file. How about this?

while read line; do
  new_file="${line:0:10}_file.log"
  echo "$line" >> $new_file
done

That just grabs the first 10 characters from the line. You could also dispense with bash entirely and just use awk:

awk '{print > ($1 "_file.log")}' < file.log

This grabs the date in $1 (the first whitespace-delimited column in each line) and uses it to generate the filename.

Note that it is possible that there are some bogus log lines in your files. That is, the problem may be with the input, not your script. You could extend the awk script to flag bogus lines like this:

awk '
$1 ~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/ {
    print > ($1 "_file.log")
    next
}

{
    print "INVALID:", $0
}
'

This writes lines match YYYY-MM-DD to your log files, and flags lines that don't start with a timestamp on stdout.

answered Mar 1, 2012 at 20:43

larsks

38.5k6 gold badges60 silver badges78 bronze badges

No bogus lines in my file: cut -c 1-10 file.log | uniq -c gives me the expected result. I am using ${line:0:4}-${line:5:2}-${line:8:2} because I will put the file in a directory ${line:0:4}/${line:5:2}/${line:8:2}, and I simplified the problem (I will update the problem statement). I know awk can help me here, but I ran in other problems using it. What I want is understand the problem with bash, not find alternate solutions.

jfg956
– jfg956

2012-03-01 20:54:08 +00:00
Commented Mar 1, 2012 at 20:54
As you said...if you "simplify" the problem in the question, you're probably not going to get the answers you want. I still think that solving this with bash is not really the right way to process this sort of data, but there's no reason it shouldn't work.

larsks
– larsks

2012-03-01 20:56:25 +00:00
Commented Mar 1, 2012 at 20:56
The simplified problem gives the unexpected result that I presented in the question, so I do not think that it is an oversimplification. Moreover, the simplified problem gives a similar result as the cut statement that works. As I want to compare apples with apples, not with oranges, I need to make things as similar as possible.

jfg956
– jfg956

2012-03-01 21:00:22 +00:00
Commented Mar 1, 2012 at 21:00
1

I left you a question that might help figure out where things are going awry...

larsks
– larsks

2012-03-01 21:20:50 +00:00
Commented Mar 1, 2012 at 21:20

Add a comment |

Arcege · Accepted Answer · 2012-03-01 20:46:03Z

2

Sounds like what you want to do is:

awk '
{  filename = substr($0, 0, 10) "_file.log";  # input format same as output format
   if (filename != lastfile) {
       close(lastfile);
       print 'finished writing to', lastfile;
   }
   print >> filename;
   lastfile=filename;
}' file.log

The close keeps the open file table from filling up.

answered Mar 1, 2012 at 20:46

Arcege

22.9k5 gold badges58 silver badges65 bronze badges

Thanks for the awk solution. I already come with something similar. My question was to understand the bash limitation, not to find an alternate solution.

jfg956
– jfg956

2012-03-01 21:05:54 +00:00
Commented Mar 1, 2012 at 21:05

Add a comment |

Stack Exchange Network

Bash scripting and large files (bug): input with the read builtin from a redirection gives unexpected result

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Bash scripting and large files (bug): input with the read builtin from a redirection gives unexpected result

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions