Linux memory manager and your big data

Disclaimer: We always assume that when we have an issue and think it's the operating system, 99% of the time, it turns out to be something else. We therefore caution against assuming that the problem is with your operating system, unless your use-case and the following example completely overlap.

It all started with one of our customers reporting performance issues with their CitusDB cluster. This customer designed their cluster such that their working set would fit into memory, but their query run-times showed every indication that their queries were hitting disk. This naturally reduced their query run times by 10-100x.

We started looking into this problem by first examining CitusDB's query distribution mechanism and then by checking the PostgreSQL instances on the machines. We found that neither was the culprit here, and came up with the following observations:

  1. The customer's working set was one day's worth of query logs. Once they were done looking at a particular day, they started querying the next day's data.
  2. Their queries involved mostly sequential I/O. They didn't use indexes a lot.
  3. A day's data occupied more than 60% of the memory on each node (but way less than total available memory). They didn't have anything else using memory on their instances.

Our assumption going into this was that since each day's data easily fit into RAM, the Linux memory manager would eventually bring that day's data into the page cache. Once the customer started querying the next day's data (and only next day's data), then the new data would come into the page cache. At least, this is what a simple cache using the LRU eviction policy would do.

It turns out LRU has two shortcomings when used as a page replacement algorithm. First, an exact LRU implementation is too costly in this context. Second, the memory manager needs to account for frequency as well, so that a large file read doesn't evict the entire cache. Therefore, Linux uses a more sophisticated algorithm than LRU; and that algorithm doesn't play along well with the workload we just described.

To put things into an example, let's assume that you have a kernel newer than 2.6.31 (released in 2009) and that you're using an m2.4xlarge EC2 instance with 68 GB of memory. Let's also say that you have two days worth of clickstream data. Each day's data takes more than 60% of available memory, but individually they easily fit into RAM.

$ ls -lh clickstream.csv.*
-rw-rw-r-- ec2-user ec2-user 42G Nov 25 19:45 clickstream.csv.1
-rw-rw-r-- ec2-user ec2-user 42G Nov 25 19:47 clickstream.csv.2

Now, let's bring in the first day's data to memory by running the "word count" command on the clickstream file several times. Note the time difference between these two runs. The first time we run the command, the Linux memory manager brings the file's pages into the page cache. On the next run, everything gets served from memory.

$ time wc -l clickstream.csv.1 
336006288 clickstream.csv.1

real	10m4.575s
...

$ time wc -l clickstream.csv.1 
336006288 clickstream.csv.1

real	0m18.858s

Then, let's switch over to the second day's clickstream file. We again run the word count command multiple times to bring the file into memory. An LRU-like policy here would evict the first day's data after several runs, and bring the second day's data into memory. Unfortunately, no matter how many times you access the second file in this case, the Linux memory manager will never bring it into memory.

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	9m50.542s

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	9m52.265s

In fact, if you run into this scenario, the only way to bring the second day's data into memory is by manually flushing the page cache. Obviously, this cure might be worse than the disease, but for our little experiment, it helps.

$ echo 1 | sudo tee /proc/sys/vm/drop_caches
1

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	9m51.906s

$ time wc -l clickstream.csv.2
336027448 clickstream.csv.2

real	0m17.874s

Taking a step back, the problem here lies with how Linux manages its page cache. The Linux memory manager keeps cached filesystem pages in two types of lists. One list holds recently accessed pages (recency list), and the other one holds pages that have been referenced multiple times (frequency list).

In current kernel versions, the memory manager splits available memory evenly between these two lists to establish a trade-off between protecting frequently used pages and detecting recently used ones. In other words, the kernel reserves 50% of available memory to the frequency list.

In the previous example, both lists start out empty. When referenced, the first day's pages first go into the recency list. On the second reference, they get promoted to the frequency list.

Next, when the user wants to work on the second day's data, this file is larger than 50% of available memory, but the recency list is not. Therefore, sequential scans over the file result in thrashing. The first filesystem page in the second file makes it into the recency list, but gets kicked out once the recency list fills up. As a result, no two pages in the second file stay long enough in the recency list for their reference counts to get incremented.

Fortunately, this issue occurs only when you have all three observations that we outlined above (very infrequent), and it's getting fixed as we speak. If you're interested, you can read more about the original problem report and the proposed fix in the Linux kernel mailing lists.

For us, the really neat part was how easy it was to identify the problem. Since Citus extends PostgreSQL, once we saw the issue, we could quickly reproduce it on Postgres. We then posted our findings to the Linux mailing lists, and the community took over from there.

Got comments? Join the discussion on Hacker News.

``` /** * 独占分配器 * 用以解决以下问题: * 1. 实现tensor复用的问题 * 2. 对于tensor使用的两个阶段实现并行,时间重叠 * 阶段一:预处理准备 * 阶段二:模型推理 * * 设计思路: * 以海底捞吃火锅为类比,座位分为两种:堂内吃饭的座位、厅外等候的座位 * * 1. 初始状态,堂内有10个座位,厅外有10个座位,全部空 * 2. 来了30个人吃火锅 * 3. 流程是,先安排10个人坐在厅外修整,20人个人排队等候 * 4. 由于堂内没人,所以调度坐在厅外的10个人进入堂内,开始吃火锅。厅外的10个座位为空 * 5. 由于厅外没人,所以可以让排队的20人中,取10个人在厅外修整 * 6. 此时状态为,堂内10人,厅外10人,等候10人 * 7. 经过60分钟后,堂内10人吃完,紧接着执行步骤4 * * 在实际工作中,通常图像输入过程有预处理、推理 * 我们的目的是让预处理和推理时间进行重叠。因此设计了一个缓冲区,类似厅外等候区的那种形式 * 当我们输入图像时,具有2倍batch的空间进行预处理用于缓存 * 而引擎推理时,每次拿1个batch的数据进行推理 * 当引擎推理速度慢而预处理速度快时,输入图像势必需要进行等候。否则缓存队列会越来越大 * 而这里提到的几个点就是设计的主要目标 **/ #ifndef MONOPOLY_ALLOCATOR_HPP #define MONOPOLY_ALLOCATOR_HPP #include <condition_variable> #include <vector> #include <mutex> #include <memory> template<class _ItemType> class MonopolyAllocator{ public: /* Data是数据容器类 允许query获取的item执行item->release释放自身所有权,该对象可以被复用 通过item->data()获取储存的对象的指针 */ class MonopolyData{ public: std::shared_ptr<_ItemType>& data(){ return data_; } void release(){manager_->release_one(this);} private: MonopolyData(MonopolyAllocator* pmanager){manager_ = pmanager;} private: friend class MonopolyAllocator; MonopolyAllocator* manager_ = nullptr; std::shared_ptr<_ItemType> data_; bool available_ = true; }; typedef std::shared_ptr<MonopolyData> MonopolyDataPointer; MonopolyAllocator(int size){ capacity_ = size; num_available_ = size; datas_.resize(size); for(int i = 0; i < size; ++i) datas_[i] = std::shared_ptr<MonopolyData>(new MonopolyData(this)); } virtual ~MonopolyAllocator(){ run_ = false; cv_.notify_all(); std::unique_lock<std::mutex> l(lock_); cv_exit_.wait(l, [&](){ return num_wait_thread_ == 0; }); } /* 获取一个可用的对象 timeout:超时时间,如果没有可用的对象,将会进入阻塞等待,如果等待超时则返回空指针 请求得到一个对象后,该对象被占用,除非他执行了release释放该对象所有权 */ MonopolyDataPointer query(int timeout = 10000){ std::unique_lock<std::mutex> l(lock_); if(!run_) return nullptr; if(num_available_ == 0){ num_wait_thread_++; auto state = cv_.wait_for(l, std::chrono::milliseconds(timeout), [&](){ return num_available_ > 0 || !run_; }); num_wait_thread_--; cv_exit_.notify_one(); // timeout, no available, exit program if(!state || num_available_ == 0 || !run_) return nullptr; } auto item = std::find_if(datas_.begin(), datas_.end(), [](MonopolyDataPointer& item){return item->available_;}); if(item == datas_.end()) return nullptr; (*item)->available_ = false; num_available_--; return *item; } int num_available(){ return num_available_; } int capacity(){ return capacity_; } private: void release_one(MonopolyData* prq){ std::unique_lock<std::mutex> l(lock_); if(!prq->available_){ prq->available_ = true; num_available_++; cv_.notify_one(); } } private: std::mutex lock_; std::condition_variable cv_; std::condition_variable cv_exit_; std::vector<MonopolyDataPointer> datas_; int capacity_ = 0; volatile int num_available_ = 0; volatile int num_wait_thread_ = 0; volatile bool run_ = true; }; #endif // MONOPOLY_ALLOCATOR_HPP```解释一下每一行代码什么意思,具体一些
最新发布
03-09
### 代码解析:独占分配器 (Monopoly Allocator) #### 文件头信息 ```cpp #ifndef MONOPOLY_ALLOCATOR_HPP #define MONOPOLY_ALLOCATOR_HPP ``` - 这两行防止文件多次包含,即**宏保护**。 #### 引入依赖库 ```cpp #include <condition_variable> #include <vector> #include <mutex> #include <memory> ``` - `#include`指令引入必要的标准库,用于线程同步及内存管理等操作。 #### 类模板声明 ```cpp template<class _ItemType> class MonopolyAllocator { ``` - 定义一个名为`MonopolyAllocator`的泛型类,支持任意类型的项(ItemType)作为其内部存储单元。这使得它可以适应多种数据结构的需求。 ##### 内部嵌套类 - 数据容器 (`MonopolyData`) ```cpp class MonopolyData { public: std::shared_ptr<_ItemType>& data() { return data_; } void release() { manager_->release_one(this); } private: friend class MonopolyAllocator; MonopolyAllocator* manager_ = nullptr; std::shared_ptr<_ItemType> data_; bool available_ = true; }; typedef std::shared_ptr<MonopolyData> MonopolyDataPointer; ``` - **构造函数**:仅允许由外部`MonopolyAllocator`实例化,并设置管理者引用。 - **成员变量** - `manager_`: 指向拥有此`MonopolyData`对象的`MonopolyAllocator`实例; - `data_`: 存储实际元素(`ItemType`)的智能指针; - `available_`: 标记当前资源是否可用,默认为真表示初始状态下所有资源都是可用的。 - **公开方法** - `data()` 返回对底层存储对象的引用访问权限; - `release()` 向所属分配器报告自身已完成任务并准备好再次被使用; #### 构造与析构函数 ```cpp MonopolyAllocator(int size); virtual ~MonopolyAllocator(); ``` - **构造函数**初始化给定大小的数量池,并创建相应数目的`MonopolyData`对象加入到列表中等待分发。 - **析构函数**负责清理工作,包括通知正在休眠的工作线程停止以及确保所有的线程都已退出后再销毁对象本身。 #### 查询接口 ```cpp MonopolyDataPointer query(int timeout = 10000); int num_available(); int capacity(); ``` - `query()`尝试从池子里取出一个未被占用的对象供调用者使用,在规定时间内若无法获得,则返回空值。它还涉及到条件变量和互斥锁机制来保证多线程环境下的安全性和效率。 - 另外两个简单的方法分别提供关于当前剩余可用资源数目及其最大容量的信息查询功能。 #### 私有辅助方法和其他属性 ```cpp void release_one(MonopolyData* prq); std::mutex lock_; std::condition_variable cv_, cv_exit_; std::vector<MonopolyDataPointer> datas_; int capacity_, num_available_, num_wait_thread_; volatile bool run_; ``` - `release_one()`更新指定对象的状态标记位,并唤醒其他因请求不到新资源而处于挂起状态的任务继续执行下去。 - 剩余部分主要是为了实现线程间的协调通信所必需的一些基础设施组件如信号量、事件计数器之类的控制逻辑。 --- 这种设计模式非常适合处理那些需要频繁创建和销毁临时小块内存的情况,比如机器学习框架中的张量(tensor),因为它能够有效地减少动态内存分配带来的性能开销同时提高了系统的响应速度和服务质量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值