I have a strange issue with large files and bash.  This is the context:
- I have a large file: 75G and 400,000,000+ lines (it is a log file, my bad, I let it grow).
 - The first 10 chars of each line is a time stamps in the format YYYY-MM-DD.
 - I want to split that file: one file per day.
 
I tried with the following script that did not work. My question is about this script not working, not alternative solutions.
while read line; do
  new_file=${line:0:10}_file.log
  echo "$line" >> $new_file
done < file.log
After debugging, I found the problem in the new_file variable.  This script:
while read line; do
  new_file=${line:0:10}_file.log
  echo $new_file
done < file.log | uniq -c
gives the result bellow (I put the xes to keep the data confidential, other chars are the real ones).  Notice the dh and the shorter strings:
...
  27402 2011-xx-x4
  27262 2011-xx-x5
  22514 2011-xx-x6
  17908 2011-xx-x7
...
3227382 2011-xx-x9
4474604 2011-xx-x0
1557680 2011-xx-x1
      1 2011-xx-x2
      3 2011-xx-x1
...
     12 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1
      1 208--
      1 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1    
...
It is not a problem in the format of my file.  The script cut -c 1-10 file.log | uniq -c gives only valid time stamps.  Interestingly, a part of the the above output becomes with cut ... | uniq -c:
3227382 2011-xx-x9
4474604 2011-xx-x0
5722027 2011-xx-x1
We can see that after the uniq count 4474604, my initial script failed.
Did I hit a limit in bash that I do not know, did I find a bug in bash (it seams unlikely), or have I done something wrong ?
Update:
The problem happens after reading 2G of the file.  It seams read and redirection do not like larger files than 2G.  But still searching for a more precise explanation.
Update2:
It definitively looks like a bug. It can be reproduced with:
yes "0123456789abcdefghijklmnopqrs" | head -n 100000000 > file
while read line; do file=${line:0:10}; echo $file; done < file | uniq -c
but this works fine as a workaround (it seams that I found a useful use of cat):
cat file | while read line; do file=${line:0:10}; echo $file; done | uniq -c 
A bug has been filed to GNU and Debian.  Affected versions are bash 4.1.5 on Debian Squeeze 6.0.2 and 6.0.4.
echo ${BASH_VERSINFO[@]}
4 1 5 1 release x86_64-pc-linux-gnu
Update3:
Thanks to Andreas Schwab who reacted quickly to my bug report, this is the patch that is the solution to this misbehavior.  The impacted file is lib/sh/zread.c as Gilles pointed out sooner:
diff --git a/lib/sh/zread.c b/lib/sh/zread.c index 0fd1199..3731a41 100644
--- a/lib/sh/zread.c
+++ b/lib/sh/zread.c @@ -161,7 +161,7 @@ zsyncfd (fd)
      int fd; {   off_t off;
-  int r;
+  off_t r;
  off = lused - lind;   r = 0;
The r variable is used to hold the return value of lseek.  As lseek returns the offset from the beginning of the file, when it is over 2GB, the int value is negative, which causes the test if (r >= 0) to fail where it should have succeed.
readstatement in bash.