Maybe you’re training a machine learning model on a really big dataset. Perhaps you’ve got a big database dump and you want to extract some information. Or maybe you’re crawling web scrapes or mining text files. Modern computers are really quite powerful for processing streams of data. You shouldn’t have to resort to a Hadoop cluster just to process data you want to use locally. There has to be a better way, right?
Why yes, there is! You can:
- Increase sequential processing speed enough to make it feasible on a single machine (i.e., speed hacks).
- Introduce an abstraction to decouple processing speed from the size of your data.
Let’s do both. In this post, we’ll show you how to sample lines from big data sources, out-of-core, as efficiently as possible on a laptop or workstation. No MapReduce required!
Why should you care?
- If you process datasets in sequential batches (e.g., using spreadsheet programs like Excel), you can do much better. For example, the maximum batch size for Excel is roughly 1 million records/rows. We routinely process datasets that are more than 5 orders of magnitude larger, at throughputs exceeding 1M records per second.
- Reading data into memory using Python? Let’s free that memory to do better things, like training machine learning models, or keeping lots of Chrome tabs open.
- You’ve already spent a lot of effort optimizing your machine learning models to train as fast as possible, and now you want to scale up to more data? Let’s rid your pipeline of sequence bias and make sure your models aren’t waiting on disk I/O.
- Using UNIX command-line tools like
grepas filters in a data processing pipeline? Those tools traverse a file sequentially—you can do better.
Sorting and hashing things? Computing hashes seems fast, until you do it billions of times…
- If you are me, in the future, and you’ve searched for this post online because, hey, it beats remembering everything…scroll down for the code!
This post has two parts.
If you’re here for “the answer” or the first two bullets resonate, try Part #1 first, and come back for Part #2 when you need more scale or speed! If you have already optimized your data pipeline and are looking for new tricks — and the latter bullets sound familiar — I recommend scanning Part #1, then digging into Part #2.
And if you’re me…
Image credit: Imgur/anonymous
Don’t have an efficient data processing pipeline? This is the place to start. There is no downside to improving performance…but incremental improvements probably aren’t enough to help you scale several orders of magnitude.
Write a simple reusable module that streams records efficiently from an arbitrarily large data source. Something like Python’s
for line in file: idiom.
Requirement: Implemented in Python
To interface with machine learning frameworks (and everything else too).
Requirement: Use idioms as much as possible (i.e., by pythonic)
To maximize our ability to read, understand, modify, and share it.
Requirement: Works on any delimited, serializable data type
If you are storing the data on disk, and your data source is delimited into records, then we can read it.
Requirement: Be as fast as possible, at least 100+ MB/s throughput. Ideally 500+ MB/s
So that streaming data is not the slow/limiting step on any downstream processes.
So it is easy to put in a Python module or class, and reuse everywhere with predictable results.
Let’s start with the straightforward pythonic way to read a sequence of records from a file:
def get_data(input_filename, delimiter = ','): with open(input_filename, 'r+b') as f: for record in f: # traverse sequentially through the file x = record.split(delimiter) # parsing logic goes here (binary, text, JSON, markup, etc) yield x # emit a stream of things # (e.g., words in the line of a text file, # or fields in the row of a CSV file)
Here we exploit Python’s lazy evaluation and iterable comprehension, slurping a sequence of records sequentially (i.e., line after line) from the file on disk. By reading binary data, we can handle any arbitrary data type. However, you’ll need some knowledge about how to split the stream into records; since we assume text data above, the easy thing is to split on whitespace (i.e., a record is a word) or commas (i.e., a record is a field in a CSV file). You might also want to parse the records further, into fields, words, etc. This solution avoids reading the whole input into memory, is readable, and simple enough there’s really no point to wrapping it for reuse. Depending on your system/context, you can probably emit a stream of data at 100+ MB/s or better throughput. This is the straightforward pythonic way to do it. Yay, Python!
The solution from Part #1 traverses the data source sequentially, yielding a predictable/ordered/biased stream of records. When you are feeding a machine learning model, this is not good. If the data source is small, you can use UNIX tools to pre-sort the data…but what if you can’t afford to pre-sort everything?
Part #2: Decouple processing speed from data traversal
By definition, the state of a sequential process depends on the previous state. For a random process, the state is independent of any previous state. Thus, by sampling randomly from the data source, we decouple the process of reading one record from the process of reading any other record. This is powerful because it enables two things:
- Reader function becomes essentially stateless. Stateless functions have nice benefits for bookkeeping overhead, robustness of machine learning models, parallelization, and more.
- When order doesn’t matter, scalability becomes a function of how many (concurrent) streams you instantiate; a stateless/randomized reader is a textbook case for concurrency.
How to do it?
- Quickly scan the file to identify the location of each record. Don’t actually load any data (for max speed).
- Store the locations (i.e., offsets) in an array. This is the sequential traversal path.
- Randomize the traversal path, by shuffling the array of offsets.
- (Optional) Divide the randomized path into N chunks to be sampled by N concurrent workers.
- Walk the path, reading and yielding the data at each location.
First we need to separate the function of constructing a traversal path through the file, from the function of emitting a stream of data. Constructing a traversal path can be done many ways. We’ll implement a straightforward one that is fast enough for anything hosted on a single machine. It runs in linear time, but does need to scan the whole file once up front. There may be faster ways to guess the right locations to seek in a file, but those are beyond scope here.
Requirement: As fast as possible
So you can handle data sources that can fit on a single machine (i.e., up to trillion sof unique records, < 10 TB). Unless you are a tech giant with your own cloud/distributed hardware infrastructure (looking at you, Google!), this should cover the vast majority of cases where you are feeding machine learning models. It’ll be much faster than most models can compute.
Requirement: Works on input larger than available RAM
Because that’s when scalability really starts to get painful. We’ll solve this by memory-mapping the data source into a 64-bit address space.
Requirement: Avoid sampling bias
If you are training a machine learning model, then you should be aware of the distributions of data you emit as input to your models. Biased distributions of training data can give undesirable results, even when the only difference is the point in training when the model gets to evaluate a given type of record. For example, tweets and Wikipedia articles may both be text data, but the distribution of words is very different between Twitter and Wikipedia. Training a model first on tweets, then on articles could have a different outcome than training a model first on articles, then on tweets. If your data sources are so big you cannot afford to hash, sort, or randomize your data on disk…how do you control the curriculum of data you are using to train your model?
[tweet_1, tweet_2, ..., tweet_n, article_1, article_2, ..., article_n]
[article_1, article_1, ..., article_n, tweet_1, tweet_2, ..., tweet_n]
By yielding a random sample from the data source, the problem of sampling bias is solved—in the limit of big data, random samples become unbiased. For reproducibility, simply pass a random seed to the randomization method.
Requirement: Works in any data ingestion pipeline
The UNIX philosophy is a gift that keeps giving:
> Write programs that do one thing, and do it well.
> Write programs to work together.
> Write programs to handle text streams, because that is a universal interface.
We embrace the fundamental API of UNIX shell using operations that read from a data source and emit delimited text records. Practically speaking, the solution described here can be a drop-in replacement for
cat in a UNIX processing pipeline.
cat yields a sequential stream of text from a file, here we yield a random stream of samples from a file.
Requirement: Linear scaling
Overall, the solution here is scales linearly, as O(n). This is no worse than an efficient sequential traversal, and we have earned important benefits: samples are unbiased and the
get_data method is essentially stateless. Here’s the breakdown
of runtime complexity:
- Scanning a file (sequentially) to find locations is O(n), with a relatively small constant. My workstation can scan 100M records in less than a minute. Do this once up front.
- Storing locations in an array. Negligible impact on runtime.
- Shuffle an array of locations (using Fisher-Yates method): O(n), with a very small constant
- Yield data for each record: O(n), with a relatively large constant. Probably dominated by parsing/processing logic. Do this once for each time you traverse your dataset.
def get_offsets(input_filename): offsets =  with open(input_filename, 'r+b') as f: i = 0 mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) # lazy eval-on-demand exploits POSIX OS filesystem to map file from 64-bit address space for record in iter(mm.readline, ''): # sentinel value comparison loc = mm.tell() # get the current location, at the start of the record offsets.append(loc) # store this position as another point on the traversal path i += 1 return offsets # alternatively, convert to a numpy `uint64` array for compactness and return the numpy array
First (sequential) pass through the data source, to get positions for the beginning of each line. Using memory-mapped virtual addressing allows us to read files much larger than available RAM.
def get_data(input_filename, offsets, delimiter = ','): random.shuffle(offsets) # shuffle in-place (alternatively, use numpy) with open(input_filename, 'r+b') as f: mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) for position in offsets: # traverse in randomized order this time mm.seek(position) # seek to random location record = mm.readline() # read the record x = record.split(delimiter) # parse the record yield x
As above, yield data one record at a time. Only difference is traversing a randomized path this time…
Putting it all together
OK, now we have an idiomatic generator that yields a stream of records from an arbitrarily large data source.
Here’s the whole thing implemented as a modular/importable python class.
""" An example class/module for instantiating a data stream with fast, random sampling. * User should define parsing logic (i.e., the `split_string`) function as required for a given data source. * User should probably use a logger rather than print. * If records are delimited by soemthing other than '\n' newlines, user should modify to suit. * For a UNIXy command line tool, wrap this class with arg parsing error handling, maybe multiprocessing, etc. """ import os import time import mmap import numpy as np class WordStream(object): """ Stream words from a corpus of newline-delimited text. Single-threaded version. Works on input larger than RAM, as long as the number of lines doesn't cause an int overflow. No worries if you're using a filesystem with 64-bit addressing. This version uses numpy for compactness of `uint64` arrays (for offsets). If you can't afford the numpy dependency but have memory to spare, plain python lists work too. Example usage: words = [x for x in WordStream('corpus.txt', shuffle = True)] """ def __init__(self, source, offsets = None, shuffle = True, seed = 2, log_each = int(5e6)): np.random.seed(seed) self.source = source # string defining a path to a file-like object self.log_each = int(log_each) # int defining the logging frequency self.filesize = int(os.stat(source).st_size) print("Reading %d bytes of data from source: '%s'" % (self.filesize, self.source)) if offsets: print("Using offsets that were given as input") else: print("No pre-computed offsets detected, scanning file...") offsets = self.scan_offsets() # expect a numpy array to be returned here self.offsets = offsets if shuffle: np.random.shuffle(self.offsets) print("offsets shuffled using random seed: %d" % seed) def __iter__(self): """ Yields a list of words for each line in the data source. If user wants to pass over the data multiple times (i.e., multiple epochs), shuffle the offsets each time, and pass the offsets explicitly when re-instantiating the generator. To keep concurrent workers busy, rewrite this as a generator that yields offsets into a queue (instead of `enumerate`)...then this generator can consume offsets from the queue. """ with open(self.source, 'r+b') as f: mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) filesize = os.stat(self.source).st_size # get filesize len_offsets = len(self.offsets) # compute once for line_number, offset in enumerate(self.offsets): # traverse random path if int(line_number) >= len_offsets: print("Error at line number: %d" % line_number) continue offset_begin = self.offsets[line_number] try: mm.seek(offset_begin) line = mm.readline() except: print("Error at location: %d" % offset) continue if len(line) == 0: continue # no point to returning an empty list (i.e., whitespace) yield [words for words in self.split_string(line)] # chain parsing logic/functions here def scan_offsets(self): """ Scan file to find byte offsets """ tic = time.time() tmp_offsets =  # python auto-extends this print("Scanning file '%s' to find byte offsets for each line..." % self.source) with open(self.source) as f: i = 0 # technically, this can grow unbounded...practically, not an issue mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) # lazy eval-on-demand via POSIX filesystem for line in iter(mm.readline, ''): pos = mm.tell() tmp_offsets.append(pos) i += 1 if i % self.log_each == 0: print("%dM examples scanned" % (i / 1e6)) toc = time.time() offsets = np.asarray(tmp_offsets, dtype = 'uint64') # convert to numpy array for compactness; can use uint32 for small and medium corpora (i.e., less than 100M lines) del tmp_offsets # don't need this any longer, save memory print("...file has %d bytes and %d lines" % (self.filesize, i)) print("%.2f seconds elapsed scanning file for offsets" % (toc - tic)) return offsets def split_string(self, s): """ Splits a string on whitespace; returns a list. Parsing logic should go in functions like this, even though this one happens to be pretty trivial. """ return s.split()
Finally, let’s put it to use! We can do something simple, like printing all the words that begin with “f” inlincoln_1861:
from fast_random_sampler import WordStream for words in WordStream("lincoln_1861.txt", shuffle = True): for word in words: if word.startswith("f"): print(word)
''' Reading 21049 bytes of data from source: 'lincoln_1861.txt' No pre-computed offsets detected, scanning file... Scanning file 'lincoln_1861.txt' to find byte offsets for each line... ...file has 21049 bytes and 77 lines 0.00 seconds elapsed scanning file for offsets offsets shuffled using random seed: 2 fundamental for forever, for first fifteen for four ... far favorable followed fraternal fugitives from '''
- Works great on solid-state drives, where random access is fast. Spinning disk drives (i.e., HDD’s) will incur a slowdown due to increased seek time on spinning disk drives.
- Randomly seeking to binary locations in a large file may prevent your disk cache (OS/firmware) from exploiting low-level optimizations.
- It’s in Python. That’s a pro or con depending on your perspective. For pure pipeline processing, a multithreaded C implementation would be nice.
- Obviously this is bad idea if order between records must be preserved, for example, time-series data. However, even in these cases, granularity of data is important…you may have many independent time-series records. For example, a time series of stock prices over time for AAPL, versus a time series of stock prices over time for MSFT.
Notes & Caveats:
- Context matters. Of course, “more data than you can process locally” has different meanings in different contexts. Using a desktop workstation with Samsung 850 Evo solid state drive, my real-world I/O limit is 500-560 MB/s. On a MacBook Pro with an older SSD, it’s somewhat slower. With the new PCIe SSD devices, you might be able to hit 2000 MB/s or more. Since it is very difficult to do complex processing (e.g., train a machine learning model) of any sort at 500 MB/s, practically speaking, most readers will find that sampling data at several hundred MB/s is “fast enough”. Anything faster than that, and computation, memory, and bandwidth are probably the limiting factors.
- How do I iterate more than once? Multiple possible solutions: (1) instantiate a new generator, with a different random seed, and call each one an epoch; (2) copy the list of offsets some number of times, concatenate, and shuffle. If you
want an infinite stream of randomized samples from a fixed data source, the offset array could be a generator function
- Operating systems and tools. If you are using a POSIX-compliant operating system like Linux, Mac/OSX, BSD, etc…then everything should work as written above. If you are using a Windows operating system, we recommend installing Cygwin to access a shell environment.
- Efficient arrays. If you don’t mind adding an extra dependency to the code, consider using Numpy data structures to store record locations, and the Numpy random method to shuffle the array. The runtime speedup will probably be negligible, but you’ll save memory, which could be significant on very large files.
- Concurrency. The solution here is well suited to a divide-and-conquer concurrent processing solution. Divide the array of offsets into N chunks and distribute amongst N workers.
- MapReduce. If you were expecting to see a “big data” essay about using MapReduce to process silly amounts of data in parallel, well, this isn’t that kind of thing, because there are already many of those lurking about online. This post was written for folks who are working on a local data source, and need to do more complex operations than map and reduce. Unless you are operating in the cloud for everything, sending data over the wire to a remote cluster may be prohibitive compared to a local read. Plus, unless you are deploying at scale, the mental overhead of configuring a MapReduce cluster just to process a bunch of data is probably unnecessary.
- What about _______? (big binary data, protobuf, Avro, etc.) A number of purpose-built tools exist for handling big streams of data. An old-school/hard-core programmer (you know, the folks who name their programs in ALLCAPS and write ANSI C) might put everything into big binaries, and slice directly into the data. But oh man, bugs can be painful, especially in a collaborative environment…to say nothing of maintainability or portability. Google’s protobufs and Apache’s Avro aim to solve the data serialization problem, but at the expense of forcing you into another abstraction or schema. Also, its probably somewhat foreign to your data science/machine learning workflow. The solutions above are not a full-on engineering solution for data serialization. We present a nice little hack—a fast, lightweight, readable, reusable way to pull unbiased streams of data from your local data sources.
- Does this work for my ___ data? If you can store it on your local filesystem, chances are you can modify a class to implement whatever custom parsing logic you might need.
Wait, there’s no magic here! Aren’t all these tricks just sensible engineering?