One of my recent projects involved scraping some web data for offline processing. I started using the excellent request
library by Mikeal Rogers, which has a number of nice and convenient improvements over the default Node http library.
As I unleashed my first prototype on the web, the database started growing much faster than I had planned. I started by storing raw and uncompressed response data, so an immediate optimization was to use the Accept-Encoding
HTTP request header to fetch compressed data from the server.
Unfortunately, some of my target servers sometimes sent back uncompressed data (which they’re entitled to do under the HTTP spec, it’s just slightly annoying). I needed a way to conditionally handle compressed data based on the Content-Encoding
response header. I founda solution that worked with the default Node.js HTTP library, but it wasn’t immediately obvious how to port that to Mikeal’s request
library.
Approach 1: no streams
My first solution collected data chunks into a Buffer, then passed that into the relevant zlib functions if needed. It’s more code than I wanted, but it works well.
Note: for simplicity, I’ve left out the logic that writes the compressed response body to the database.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051 |
|
Approach 2: streams
The downside to the first approach is that all response data is buffered in memory. This was fine for my use case, but in general this can cause memory issues if you’re scraping websites with really large response bodies.
A better approach is to use streams, as Mikeal suggested. Streams are a wonderful abstraction that can help you manage memory consumption better, among other things. There are two great introductions to Node streams here and here. Keep in mind that streams in Node.js are somewhat intricate and still evolving (for example, Node 0.10 introduced streams2 which is not entirely backwards compatible with older versions of Node).
Here’s a working solution that pipes response data into a zlib stream, then pipes that into a final destination (a file, in this case). Notice that the code is cleaner and more readable.
1234567891011121314151617181920212223242526272829303132333435363738394041 |
|
Summary
Both of those approaches will get the job done with Mikeal’s library, and the one you choose depends on the use case. In my project, I needed to save the compressed response data as a field of a Mongoose document, then further process the decompressed data. Streams don’t suit this use case well, so I used the first approach.