opening and writing a large binary file python

Question

I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.

In linux this is a non-issue (the local dev box is a windows system).

I have the following code

algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')

data = file.read(settings.READ_SIZE)
while data:
    dec_data = decrypt.update(data)
    temp_file.write(dec_data)
    data = file.read(settings.READ_SIZE)

# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()

The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).

My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.

if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...

Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):

algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')

filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'

data = file.read(settings.READ_SIZE)
while data:
    dec_data = decrypt.update(data)
    response.write(dec_data)
    data = file.read(settings.READ_SIZE)

# no dumps to be taken when finishing up the decrypt process... 
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()

Clarification

The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).

When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.

When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.

What does "takes a dump here", "will not read to end", etc., mean? What exactly happens? — abarnert
– abarnert, Commented Sep 6, 2013 at 19:27
@abarnert clarification added. - to note, I know that adding + seems to fix this, but I want to know why it does on windows and what is going on when it opens the file handle. The python docs in this area are quite weak. — Mike McMahon
– Mike McMahon, Commented Sep 6, 2013 at 19:34
In the other section of code there's no decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:]), so is it still using the decrypt from the first section? — martineau
– martineau, Commented Sep 6, 2013 at 19:41
@mart it's creating another decrypt object, but that's unimportant to the question at hand. The cipher issue is a result of python not properly reading the contents of a file unless i specify mode 'rb+' where 'rb' should suffice... — Mike McMahon
– Mike McMahon, Commented Sep 6, 2013 at 19:43
@MikeMcMahon: I'm confused by your clarification. You say that without the b it cycles through the loop twice, and then you say that with b but not + it does the same thing as without b, and then you say that it reads a couple segments and then closes the handle. That's contradictory. And neither one makes any sense. The only difference between r and rb is that r will translate line endings (so each pair of bytes that matches 0D 0A will be replaced by a single byte 0A). — abarnert
– abarnert, Commented Sep 6, 2013 at 20:02

abarnert · Accepted Answer · 2013-09-06 22:31:17Z

I'm going to take a guess here:

You have some other program that's continually replacing the files you're trying to read.

On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.

On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.

Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.

In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…

Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*

* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.

i'll investigate with the process explorer tools, but i cannot find anything that is altering the files (or continuously writing to them)..
while i can't find a race condition you did explain the difference between r+b and rb best and why it would cause a difference i how the file was handled (locking it vs sharing). This is mostly a trivial issue at this point since I can't seem to find what may be touching it (as nothing should) and because we run this primarily in a *nix environment where (as you said) it does support atomic file operations.
@MikeMcMahon: If it's on a local drive, FindFirstChangeNotification is the best way to find intermittent accesses to the file, and there are tools that wrap that up nicely (I think sysinternals has a command-line tool, and it's in one of their GUI tools). If it's on an SMB share, that's a whole other ball of wax, but hopefully you don't have to deal with that.

Collectives™ on Stack Overflow

opening and writing a large binary file python

Clarification

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Clarification

1 Answer 1

3 Comments

Linked

Related