I'm running into a problem while trying to load large files using Python 3.5. Using read() with no arguments sometimes gave an OSError: Invalid argument. I then tried reading only part of the file and it seemed to work fine. I've determined that it starts to fail somewhere around 2.2GB, below is the example code:
>>> sys.version
'3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]'
>>> x = open('/Users/username/Desktop/large.txt', 'r').read()
Traceback (most recent call last):
File "", line 1, in
OSError: [Errno 22] Invalid argument
>>> x = open('/Users/username/Desktop/large.txt', 'r').read(int(2.1*10**9))
>>> x = open('/Users/username/Desktop/large.txt', 'r').read(int(2.2*10**9))
Traceback (most recent call last):
File "", line 1, in
OSError: [Errno 22] Invalid argument
I also noticed that this does not happen in Python 2.7. Here is the same code run in Python 2.7:
>>> sys.version
'2.7.10 (default, Aug 22 2015, 20:33:39) \n[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.1)]'
>>> x = open('/Users/username/Desktop/large.txt', 'r').read(int(2.1*10**9))
>>> x = open('/Users/username/Desktop/large.txt', 'r').read(int(2.2*10**9))
>>> x = open('/Users/username/Desktop/large.txt', 'r').read()
>>>
I am using OS X El Capitan 10.11.1.
Is this a bug or should use another method for reading the files?
解决方案
Yes, you have bumped into a bug.
Good news is that someone else has also found it and already created an issue for it in the Python bug tracker, see: Issue24658 - open().write() fails on 2 GB+ data (OS X). This, seems, is platform depended (OS-X only) and is reproducible when using read and/or write. Apparently an issue exists with the way fread.c is implemented in the libc implementation for OS-X see here.
Bad News is that it is still open (and, currently, inactive) so, you'll have to wait until it is resolved. Either way, you can still take a look at the discussion there if you're interested for the specifics.
Regardless, I'm pretty sure you can side-step the issue, until it is fixed, by reading in chunks and chaining the chunks during processing and doing the same when writing. Clunky but, it might do the trick.