InnoDB’s multi-versioning handling can be Achilles’ heel

InnoDB’s multi-versioning handling can be Achilles’ heel    >>>>innodb 的多版本控制可能导致出现重大问题

Peter Zaitsev  | December 17, 2014 |  Posted In: InnoDBInsight for DBAsMySQL

I believe InnoDB storage engine architecture is great for a lot of online workloads, however, there are no silver bullets in technology and all design choices have their trade offs. In this blog post I’m going to talk about one important InnoDB limitation that you should consider.    >>>>我相信 innodb存储引擎能够很好的应对大多数类型的业务负载,然而没有一项技术是万能的,能够处理所有的应用场景,所有技术都有利有弊,选型时做好权衡。本篇博客我将谈论你需要考虑的关于 innodb存储引擎的一个很重要的限制。

InnoDB is a multiversion concurrency control (MVCC) storage engine which means many versions of the single row can exist at the same time. In fact there can be a huge amount of such row versions. Depending on the isolation mode you have chosen, InnoDB might have to keep all row versions going back to the earliest active read view, but at the very least it will have to keep all versions going back to the start of SELECT query which is currently running.    >>>>innodb 存储引擎支持 MVCC(innodb 存储引擎在READ-COMMITTED和REPEATABLE-READ两种隔离级别下支持MVCC)。这就意味着某行在同一时刻可能在数据库中存在多个版本。根据隔离级别的不同,innodb 的多版本控制也有所不同,READ-COMMITTED隔离级别下,在语句开始时创建read-view(记录语句开始时实例中所有活动的事务),而REPEATABLE-READ隔离级别下,在事务开始时创建read-view。innodb根据read-view和聚集索引记录中的事务id判断该条记录是否可见。关于innodb存储引擎多版本控制时如何判断可见性,可以参考:何登成的"InnoDB多版本(MVCC)实现简要分析"

In most cases this is not a big deal – if you have many short transactions happening you will have only a few row versions to deal with. If you just use the system for reporting queries but do not modify data aggressively at the same time you also will not have many row versions. However, if you mix heavy updates with slow reporting queries going at the same time you can get into a lot of trouble.    >>>>如果你的应用多是一些小事务或者你的系统用来做报表查询但是不会有大量修改操作,这些情况下 innodb 存储引擎的 mvcc控制不会引发一些大问题。

Consider for example an application with a hot row (something like actively updated counter) which has 1000 updates per second together with some heavy batch job that takes 1000 to run. In such case we will have 1M of row versions to deal with.  >>>>考虑下面这样一种情况,如果你的应用在每秒钟会对行数据修改1000次,同时一个大事务需要执行1000s,这时对于该行数据,就会同时存在 1000000个版本。

Let’s now talk about how those old-row versions are stored in InnoDB – they are stored in the undo space as an essentially linked list where each row version points to the previous row version together with transaction visibility information that helps to decide which version will be visible by this query. Such design favors short new queries that will typically need to see one of the newer rows, so they do not have to go too far in this linked list. This might not be the case with reporting queries that might need to read rather old row version which correspond to the time when the query was started or logical backups that use consistent reads (think mysqldump or mydumper) which often would need to access such very old row versions.    >>>>现在我们来讨论一下旧版本的行在innodb中 如何存储,它们被存储在undo log中,undo log 本质上就是链表,聚集索引上每一行都会包含指向前一个版本的指针信息和用来判断该行可见性的信息(事务id信息)。这样的设计对于那些只需要访问比较新的版本的小事务是没有问题的。但是对于那些执行时间比较长的报表查询来说,可能需要访问很老的行版本,这时就查询的代价就可能很大(因为需要顺着链表一直往前访问旧的版本直到确定该版本对查询是可见的,这时如果需要访问的undo log没有在内存中,那么io成本是很大的)

So going through the linked list of versions is expensive, but how expensive it can get? In this case a lot depends upon whenever UNDO space fits in memory, and so the list will be traversed efficiently – or it does not, in which case you might be looking at the massive disk IO. Keep in mind undo space is not clustered by PRIMARY key, as normal data in InnoDB tables, so if you’re updating multiple rows at the same time (typical case) you will be looking at the row-version chain stored in many pages, often as little as one row version per page, requiring either massive IO or a large amount of UNDO space pages to present in the InnoDB Buffer pool.    >>>>遍历整个行版本的链表是很耗资源的操作,但是到底有多耗资源呢?这主要取决于你需要访问的undo (行的旧版本都保存在undo中)是否在内存中(buffer pool),如果所有需要的undo 都存在内存中,那么遍历效率会很高,如果不是存在内存中,则需要从磁盘中读取,那么很耗io资源(大部分是离散io)。注意undo中的数据并不想普通的innodb表以聚集索引的格式保存。所以如果你同时修改多个行,你会发现某个行的链表是保存在多个page上,当你访问这个行链表时,就需要消耗很多io资源,同时需要大量的内存空间(buffer pool)来保存undo page

Where it can get even worse is Index Scan. This is because Indexes are structured in InnoDB to include all row versions corresponding to the key value, current and past. This means for example the index for KEY=5 will contain pointers to all rows that either have value 5 now or had value 5 some time in the past and have not been purged yet. Now where it can really bite is the following – InnoDB needs to know which of the values stored for the key are visible by the current transaction – and that might mean going through all long-version chains for each of the keys.    >>>>当你是通过索引访问的时候,情况就更槽糕了。这是因为索引会指向所有

This is all theory, so lets see how we can simulate such workloads and see how bad things really can get in practice.

I have created 1Bil rows “sysbench” table which takes some 270GB space and I will use a small buffer pool – 6GB. I will run sysbench with 64 threads pareto distribution (hot rows) while running a full table scan query concurrently: select avg(k) from sbtest1 Here is exact sysbench run done after prepare.

Here is the explain for the “reporting” query that you would think to be a rather efficient index scan query. With just 4 bytes 1 Billion of values would be just 4G (really more because of InnoDB overhead) – not a big deal for modern systems:

2 days have passed and the “reporting” query is still running… furthermore the foreground workload started to look absolutely bizarre:

As you can see we have long stretches of times when there are no queries completed at all… going to some spikes of higher performance. This is how it looks on the graph:

Corresponding CPU usage:

This shows what we are not only observing something we would expect with InnoDB design but also there seems to be some serve starvation happening in this case which we can confirm:

Waiting for the given buffer pool block to become available for more than 3 minutes is a big issue – this lock should never be held by more than a few microseconds.

SHOW PROCESSLIST confirms even most basic selects by primary key can get stalled for long time

How do I know it is UNDO space related issue in this case? Because it ends up taking majority of buffer pool

And it does so in a very crazy way – when there is almost no work being done UNDO_LOG contents of the buffer pool is growing very rapidly while when we’re getting some work done the INDEX type pages take a lot more space. To me this seems like as the index scan is going it touches some hot rows and some not-so-hot ones, containing less row versions and so does not put much pressure on “undo space.”

Take Away: Now you might argue that this given workload and situation is rather artificial and rather narrow. It well might be. My main point here is what if you’re looking at just part of your workload, such as your main short application queries, and not taking reporting or backups into account “because their performance is not important.” In this case you might be in for a big surprise. Those background activities might be taking much more than you would expect, and in addition, they might have much more of a severe impact to your main application workload, like in this case above.

P.S: I’ve done more experiments to validate how bad the problem really is and I can repeat it rather easily even without putting system into overdrive. Even if I run sysbench injecting just 25% of the transactions the system is possibly capable of handling at peak I have “select avg(k) from sbtest1” query on 1 billion row table to never complete as it looks like the new entries are injected into the index at this point faster than Innodb can examine which of them are visible.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值