I want to read all lines of a 1 GB large file as fast as possible into a Stream. Currently I'm using Files(path).lines() for that. After parsing the file, I'm doing some computations (map()/filter())
At first I thought this is already done in parallel, but it seems I'm wrong:
When reading the file as it is, it takes about 50 seconds on my dual CPU laptop.
However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.
I tried the following combinations:
single file, no parallel lines() stream ~ 50 seconds
single file, Files(..).lines().parallel().[...] ~ 50 seconds
two files, no parallel lines() strean ~ 30 seconds
two files, Files(..).lines().parallel().[...] ~ 30 seconds
I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...] is a chain of map and filter only, with a toArray(...) at the end to trigger the evaluation.
The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.
Edit:
I want to point out that I use an SSD, so there is practically to seeking time. The file has 1658652 (relatively short) lines in total.
Splitting the file in bash takes about 1.5 seconds:
time split -l 829326 file # 829326 = 1658652 / 2
split -l 829326 file 0,14s user 1,41s system 16% cpu 9,560 total
So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores,
the first line reader should start at the first line and a second one at line (totalLines/2)+1.
解决方案
You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).
If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.