Disclaimer: We always assume that when we have an issue and think it's the operating system, 99% of the time, it turns out to be something else. We therefore caution against assuming that the problem is with your operating system, unless your use-case and the following example completely overlap.
It all started with one of our customers reporting performance issues with their CitusDB cluster. This customer designed their cluster such that their working set would fit into memory, but their query run-times showed every indication that their queries were hitting disk. This naturally reduced their query run times by 10-100x.
We started looking into this problem by first examining CitusDB's query distribution mechanism and then by checking the PostgreSQL instances on the machines. We found that neither was the culprit here, and came up with the following observations:
- The customer's working set was one day's worth of query logs. Once they were done looking at a particular day, they started querying the next day's data.
- Their queries involved mostly sequential I/O. They didn't use indexes a lot.
- A day's data occupied more than 60% of the memory on each node (but way less than total available memory). They didn't have anything else using memory on their instances.
Our assumption going into this was that since each day's data easily fit into RAM, the Linux memory manager would eventually bring that day's data into the page cache. Once the customer started querying the next day's data (and only next day's data), then the new data would come into the page cache. At least, this is what a simple cache using the LRU eviction policy would do.
It turns out LRU has two shortcomings when used as a page replacement algorithm. First, an exact LRU implementation is too costly in this context. Second, the memory manager needs to account for frequency as well, so that a large file read doesn't evict the entire cache. Therefore, Linux uses a more sophisticated algorithm than LRU; and that algorithm doesn't play along well with the workload we just described.
To put things into an example, let's assume that you have a kernel newer than 2.6.31 (released in 2009) and that you're using an m2.4xlarge EC2 instance with 68 GB of memory. Let's also say that you have two days worth of clickstream data. Each day's data takes more than 60% of available memory, but individually they easily fit into RAM.
$ ls -lh clickstream.csv.* -rw-rw-r-- ec2-user ec2-user 42G Nov 25 19:45 clickstream.csv.1 -rw-rw-r-- ec2-user ec2-user 42G Nov 25 19:47 clickstream.csv.2
Now, let's bring in the first day's data to memory by running the "word count" command on the clickstream file several times. Note the time difference between these two runs. The first time we run the command, the Linux memory manager brings the file's pages into the page cache. On the next run, everything gets served from memory.
$ time wc -l clickstream.csv.1 336006288 clickstream.csv.1 real 10m4.575s ... $ time wc -l clickstream.csv.1 336006288 clickstream.csv.1 real 0m18.858s
Then, let's switch over to the second day's clickstream file. We again run the word count command multiple times to bring the file into memory. An LRU-like policy here would evict the first day's data after several runs, and bring the second day's data into memory. Unfortunately, no matter how many times you access the second file in this case, the Linux memory manager will never bring it into memory.
$ time wc -l clickstream.csv.2 336027448 clickstream.csv.2 real 9m50.542s $ time wc -l clickstream.csv.2 336027448 clickstream.csv.2 real 9m52.265s
In fact, if you run into this scenario, the only way to bring the second day's data into memory is by manually flushing the page cache. Obviously, this cure might be worse than the disease, but for our little experiment, it helps.
$ echo 1 | sudo tee /proc/sys/vm/drop_caches 1 $ time wc -l clickstream.csv.2 336027448 clickstream.csv.2 real 9m51.906s $ time wc -l clickstream.csv.2 336027448 clickstream.csv.2 real 0m17.874s
Taking a step back, the problem here lies with how Linux manages its page cache. The Linux memory manager keeps cached filesystem pages in two types of lists. One list holds recently accessed pages (recency list), and the other one holds pages that have been referenced multiple times (frequency list).
In current kernel versions, the memory manager splits available memory evenly between these two lists to establish a trade-off between protecting frequently used pages and detecting recently used ones. In other words, the kernel reserves 50% of available memory to the frequency list.
In the previous example, both lists start out empty. When referenced, the first day's pages first go into the recency list. On the second reference, they get promoted to the frequency list.
Next, when the user wants to work on the second day's data, this file is larger than 50% of available memory, but the recency list is not. Therefore, sequential scans over the file result in thrashing. The first filesystem page in the second file makes it into the recency list, but gets kicked out once the recency list fills up. As a result, no two pages in the second file stay long enough in the recency list for their reference counts to get incremented.
Fortunately, this issue occurs only when you have all three observations that we outlined above (very infrequent), and it's getting fixed as we speak. If you're interested, you can read more about the original problem report and the proposed fix in the Linux kernel mailing lists.
For us, the really neat part was how easy it was to identify the problem. Since Citus extends PostgreSQL, once we saw the issue, we could quickly reproduce it on Postgres. We then posted our findings to the Linux mailing lists, and the community took over from there.
Got comments? Join the discussion on Hacker News.