转载自:https://towardsdatascience.com/what-is-npy-files-and-why-you-should-use-them-603373c78883
Introduction
First of all thanks a lot to my followers for sticking with me these last few months, I’ve been terribly busy and haven’t had a lot of time to pump out articles. I’ve decided that a partial remedy for this is to make some shorter and easier to digest articles which will be easier to produce! Therefore this is my first attempt at making a short-and-to-the-point article.
I hope you find it useful!
Make sure to follow my profile if you enjoy this article and want to see more!
The TL;DR:
Reading a 10 million data-point file from storage:
The results speak for themselves.
🥇 1st place: .npy files with time: 0.13 seconds
This is by far the fastest method of loading in data.
🥈 2nd place: .csv files with time: 2.66 seconds
Pandas proved that .csv files aren’t useless, but it’s still lacking in speed.
🥉 3rd place: .txt files with time: 9.67 seconds
This is just so slow compared to the others that it’s painful.
Why .npy and Numpy?
If you’ve ever done any kind of data processing in Python you’ve undoubtedly come across Numpy and Pandas. These are the giants of Data Science in Python and stand as the foundation for a lot of other packages, namely Numpy provides the fundamental objects used by the likes of Scikit-Learn and Tensorflow!
So why am I talking about these packages and why Numpy in particular? Well as you might know, the “industry standard” with regard to data-files is .csv files. Now while convenient, these files are highly un-optimized when compared to the alternatives, like the .npy files provides as courtesy of Numpy.
“Who cares, let’s see the code and evidence!”
Right, on with the show!
The code
Let’s start off by simply creating 10 million random points of data and save it as Comma Separated Values:
Now let’s load this by traditional means and do a simple reshaping of the data:
This is the output I get:
Almost 10 seconds to load!
Now, you might think that the reshaping is preventing a faster load but even if we don’t do any reshaping we get a similar time!
So now that we have our 10000-by-1000-array, let’s go ahead and save it as an .npy file:
np.save('data.npy', data_array)
Right, that was pretty easy right? So now that we have our array in .npy format let’s see how fast we can read it in:
Which gives me the following output:
Wow! More than 70x faster!
A LOT faster, also notice that we didn’t need to reshape the data since that information was contained in the .npy file.
Another “minor” feature of using .npy files is the reduced storage the file occupies. In this case it’s more than a 50% reduction in size. This can wary a lot though but in general the .npy files are more storage friendly.
“What about Pandas and their .csv handling?”
Let’s find out!
First let’s create a proper .csv file for Pandas to read, this would be the most likely real-life scenario.
data = pd.DataFrame(data_array) data.to_csv('data.csv', index = None)
This simply saves the ‘data_array’ we created before as a standard .csv file without index.
Now let’s load it and see what kind of time we get:
Which gives me the following output:
2.66 seconds.. Faster than the standard .txt read but still snails pace compared to the .npy file!
Now you might think this is cheating because we’re also loading into a Pandas DataFrame, but it turns out that the time-loss for that is negligible, if we read in like this:
data_array = np.load('data.npy') data = pd.DataFrame(data_array)
And time it we get the following:
Almost no different from loading without a DataFrame.
The take-away
You’re probably used to loading and saving data as .csv but the next time you do a data-science project try getting into the habit of loading and saving to .npy files instead! It’ll save you a lot of downtime and annoyance when you’re waiting for the kernel to load your file!