how to write super fast file streaming code in C#?

Question

From: http://stackoverflow.com/questions/955911/how-to-write-super-fast-file-streaming-code-in-c?answertab=votes#tab-top

I have to split huge file into many smaller files. each of the destination file is defined by offset and length as number of bytes. I'm using the following code:

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.

Is there a way to make the Writer connected directly to the Reader? (i.e. without actually loading the contents into the Buffer in memory)

File.OpenRead and File.OpenWrite 100,000 will be slow alright...
Are you splitting the file perfectly, i.e. could you rebuild the large file by just joining all the small files together? If so there are savings to be had there. If not, do the ranges of the small files overlap? Are they sorted in order of offset?

Jon Skeet · Answer 1 · 2009-06-05 13:49:02Z

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't.

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.

I think it'll be significantly faster, but obviously you'll need to try it to see...

This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

I know this is a two year old post, just wondered... is this still the fastest way? (i.e. Nothing new in .Net to be aware of?). Also, would it be faster to perform the Math.Min prior to entering the loop? Or better yet, to remove the length parameter as it can be calculated by means of the buffer? Sorry to be picky and necro this! Thanks in advance.
@Smudge202: Given that this is performing IO, the call to Math.Min is certainly not going to be relevant in terms of performance. The point of having both the length parameter and the buffer length is to allow you to reuse a possibly-oversized buffer.
Gotcha, and thanks for getting back to me. I'd hate to start a new question when there is likely a good enough answer right here, but would you say, that if you wanted to read the first x bytes of a large number of files (for the purpose of grabbing the XMP metadata from a large number of files), the above approach (with some tweaking) would still be recommended?
@Smudge202: Well the code above is for copying. If you only want to read the first x bytes, I'd still loop round, but just read into a right-sized buffer, incrementing the index at which the read will write into the buffer appropriately on each iteration.
Yup, I'm less interested in the writing part, I just wanted to confirm that the fastest way to read one file is also the fastest way to read many files. I imagined being able to P/Invoke for file pointers/offsets and from there being able to scan across multiple files with the same/less streams/buffers, which in my imaginary world of make believe, would possibly be even faster for what I want to achieve (though not applicable to the OP). If I'm not barking mad, probably best I start a new question. If I am, could you let me know so I don't waste even more peoples' time? :-)
@Smudge202: Do you actually have a performance problem right now? Have you written the simplest code that works and found that it's too slow? Bear in mind that a lot can depend on context - reading in parallel may help if you're using solid state, but not on a normal hard disk, for example.
SO is advising I take this to a chat, I think I'll actually start a new question. Thank you thus far!

how to write super fast file streaming code in C#?

9 Answers