Kernel development

最新推荐文章于 2024-08-02 06:21:08 发布

u010154760

最新推荐文章于 2024-08-02 06:21:08 发布

阅读量929

点赞数

分类专栏： 2015年4月 linux 内存文章标签： linux内存 anon_vma_chain

2015年4月同时被 2 个专栏收录

180 篇文章 0 订阅

订阅专栏

linux 内存

20 篇文章 0 订阅

订阅专栏

Brief items

Kernel release status

The current development kernel is 2.6.34-rc4 , released on April 12, about a week later than would have been expected. The delay was the result of a nasty VM regression (see below). All told, some 500 fixes have been merged since -rc3; see the announcement for the short-form changelog, or see the full changelog for all the details.

There have been no stable updates released in the last week.

Comments (none posted)

Quotes of the week

Hey, all my other theories made sense too.. They just didn't work.

But as Edison said: I didn't fail, I just found three other ways to not fix your bug.

-- Linus Torvalds

To be honest I think 4K stack simply has to go. I tend to call it "russian roulette" mode.

It was just a old workaround for a very old buggy VM that couldn't free 8K pages and the VM is a lot better at that now. And the general trend is to more complex code everywhere, so 4K stacks become more and more hazardous. It was a bad idea back then and is still a bad idea, getting worse and worse with each MLOC being added to the kernel each year.

-- Andi Kleen

Comments (7 posted)

Idle cycle injection

By Jonathan Corbet
April 14, 2010

When Google's Mike Waychison addressed the 2009 Kernel Summit , one of the goals he laid out was the merging of Google's idle cycle injection code into the mainline. Idle cycle injection is the forced idling of the CPU to avoid overheating; essentially, it is Google's way of running processors to the very edge of their capability without going past that edge and allowing the smoke to escape. This sort of power management is certainly not a Google-specific problem, so it makes sense to get the code upstream. Salman Qazi's recently posted kidled patch series shows the current form of this work.

The core idea is simple: through some new control files under /proc/sys/kernel/kidled, the system administrator can set, on a per-CPU basis, the percentage of time that the CPU should be idle and an interval over which that percentage is calculated. If the end of an interval draws near and the CPU has not been naturally idle for the requisite time, kidled will force the processor to go idle for a while.

Naturally enough, there are some complications. The first is that it would be nice to avoid forcing idle cycles when important processes are running. So kidled includes the notion of "eager cycle injection." By way of the control group mechanism, processes can be marked as being "interactive." When so-marked processes are not running, kidled will try to get its forced idle cycles in early. When interactive processes are running, instead, idle cycles will be forced only when strictly necessary. In this way, "interactive" processes will not be impeded by idle cycle injection except when there is no alternative.

The other twist has to do with the accounting of idle CPU time. The injection of idle cycles takes CPU time away from somebody; the kidled code allows the administrator to say who the victims should be. There is another control group parameter which controls the "power capping priority" of each process. When idle cycles are injected, kidled will mess around in the scheduler's data structures, causing processes with lower priorities to be charged for the idle time. That means that, when CPU usage must be throttled, specific processes can be made to suffer more than others.

As of this writing, there has been little public discussion of the patches. The core concept is not controversial, but it will be interesting to see how the scheduler-related parts of the series are received.

Comments (10 posted)

Kernel development news

ELC: Status of embedded Linux

By Jake Edge
April 14, 2010

Embedded Linux Conference (ELC) organizer Tim Bird surveyed the embedded Linux landscape in a talk he gave at the conference. He looked at new and proposed kernel features that embedded developers might be interested in as well as issuing a "call to arms" to those developers to get more involved with the rest of the community. This talk is a regular feature at ELC to help the embedded community stay on top of the "ripping" speed of kernel development.

There have been four new kernels since last year's conference, and Bird listed the interesting features for the embedded space in each of 2.6.30-33, as well as noting that LogFS had finally made it into the kernel in 2.6.34, something that he was concerned might not ever happen. The speed of kernel development is amazing, he said, and the great thing about it is that "even while I am sleeping in my bed, people are pounding away on it".

He pointed to a few "patches to watch" that may be coming in new kernels, specifically the kbuild CROSS_COMPILE option, which will make it easier to build for multiple architectures. He also noted Arnd Bergmann's asm-generic patches that are geared towards making it easier to add new architectures to the kernel—without propagating the bugs and quirks from existing ones.

Boot speed

Bird then looked at different "technology areas" to point out interesting features or work going on in those areas. Boot time is a "hot topic" right now; it would have been in the past if the embedded community was more involved in mainline kernel development. The Moblin five second boot effort really kickstarted that work. He noted that he has a Sony (his employer) video camera that boots Linux in 1.5 seconds; "I'm very proud of that", he said.

Several new kernel features are available to help reduce boot time, including asynchronous function calls, which allow some parts of device initialization to run in parallel. There is also scripts/bootgraph.pl to help visualize where boot time is being spent.

Devtmpfs was also noted as a way to decrease boot times, with some seeing a 0.6 second reduction on desktops. Bird said that there needs to be some testing done on the embedded side to see how much it can help there. He also listed two patches that speed up symbol resolution for module loading by getting rid of the current linear search. One switches to a binary search and the other uses a hash table. For Bird's use cases, he always statically links in drivers, but has heard that more embedded developers are going the loadable module route.

Greg Kroah-Hartman piped up that he needed one of those two patches for MeeGo, but that the submitters had disappeared. There was general agreement that contacting them and getting something upstream would be good.

Filesystems

Several different filesystems for embedded use cases were listed by Bird. Squashfs has been out of the mainline for years, but was merged in 2.6.29, and has since been improved by others in a "classic case" demonstrating the advantages of mainline code. Ubifs is also in the mainline and folks at Toshiba have been characterizing its performance, which they reported on at the CE Linux Forum (CELF) Japan Jamboree. It has "really slow mount times" in some cases, which CELF would like to fund someone to fix.

LogFS is "way better optimized" for certain flash devices and has fast mount times, he said. He noted that AXFS, the advanced execute-in-place (XIP) filesystem, had kind of disappeared, so it didn't appear to be on track for mainlining. He has been playing with AXFS at Sony to try to further decrease boot time.

Bird also noted that the VFAT patent avoidance patches had not made it into the mainline. It would be useful for some embedded devices, he said. Most embedded developers work around the patent by disabling VFAT and using 8.3 filenames, which is somewhat unfortunate. Another thing he is keeping an eye on is VFS-based union mounts, which would allow embedded developers to stop creating "filesystems with weird links" between them as is currently common.

Power management and realtime

The runtime power management code has been merged, which will allow suspending and resuming individual system components to reduce power consumption. There is ongoing work on asynchronous suspend/resume, which Bird said he didn't know very much about, but it's "gotta be really cool". An audience member helped out by saying that it is in some ways like the asynchronous initialization code (for faster boot), but "in the other direction".

The RT_PREEMPT patchset "continues its slow march into the kernel", with threaded interrupt handlers being merged in 2.6.30 and preparatory work for the future sleeping spinlocks merge that went into 2.6.33. There are still some big kernel lock issues (BKL) to be resolved and CELF may fund some work in that area.

Kernel size

The slide for kernel size and memory use had a picture of a "hybrid Winnebago", which is the image Bird has of the kernel today. It just keeps growing in size. To help embedded developers make better use of limited memory, there is the smem tool that was funded by CELF. He has used it in a few projects this year and it "has been very helpful".

Various compression methods have been added to compress the kernel image in different ways. LZMA can be up to 30% better than gzip, and LZO is not as good at compression, but is much faster. There are tradeoffs dependent on processor speed and I/O bandwidth that make it more difficult to pick the right compression method, as Dirk Hohndel pointed out.

The ramzswap device (also known as compcache) allows in-memory compressed swap. It is "really cool" but the maintainer only was able to benchmark on desktop systems. It would be good if someone could do some benchmarking on embedded systems, Bird said.

Tracing and security

Ftrace now has support for dynamic probes that came in 2.6.33, and the perf tool can place and use those probes as well. There is tracing of kernel variable access and modification available now. The perf "diff" mode can show the performance differences between two runs, and also came in 2.6.33.

The TOMOYO merge in 2.6.30 was "a big deal" because it finally was able to get path-based security into the kernel. NTT Data is now adding TOMOYO rules to Android. Bird is in favor of a diversity of choices for security as it gives people a chance to demonstrate which is the best solution for various use cases. As part of that, CELF funded astudy [PDF] for applying the Smack security module in a television use case and found that the overhead was higher than they expected.

CELF contract work

CELF has funded various projects over the last year including smem, out-of-memory notifications in cgroups, SquashFS, the Smack analysis, device trees for ARM, and the -ffunction-sections work to put each kernel function in its own section to assist with dead code removal. Going forward, CELF has an open project proposal plan that will start funding new projects in the next few weeks. It is also sponsoring Matt Mackall to be one of the two Linux kernel embedded maintainers (David Woodhouse is the other).

A call to arms

Bird ended his talk with a list of things that embedded developers can do to work better with the community. At the top of that list was "work at top of tree". He realized that when he gives these talks, he is generally talking to people who aren't using the kernels he talks about because embedded folks tend to pick a kernel and stick with it. "It's difficult" to work with the most recent kernel, but it's worth it. "Version gap is the single biggest problem" in embedded Linux. He suggested that embedded developers beat up on their board vendor to get board support packages using the latest kernels and to do their testing on boards that are already supported in the mainline.

The other suggestion he had was not to "wait for others to test new features", and instead to do the testing themselves. He listed a number of things that need testing in the mainline: LogFS, Ubifs mount times, ramzswap, runtime power management, and so on. "Post the results to the elinux.org wiki" or come to the next conference (October 27-28 in Cambridge, UK) and tell him about it.

Comments (2 posted)

The case of the overly anonymous anon_vma

By Jonathan Corbet
April 13, 2010

During the stabilization phase of the kernel development cycle, the -rc releases typically happen about once every week. 2.6.34-rc4 is a clear exception to that rule, coming nearly two weeks after the preceding -rc3 release. The holdup in this case was a nasty regression which occupied a number of kernel developers nearly full time for days. The hunt for this bug is a classic story of what can happen when the code gets too complex.

Sending email to linux-kernel can be an intimidating prospect for a number of reasons, one of which being that one never knows when a massive thread - involving hundreds of messages copied back to the original sender - might result. Borislav Petkov's 2.6.34-rc3 bug report was one such posting. In this case, though, the ensuing thread was in no way inflammatory; it represents, instead, some of the most intensive head-scratching which has been seen on the list for a while.

The bug, as reported by Borislav, was a null pointer dereference which would happen reasonably reliably after hibernating (and restarting) the system. It was quickly recognized as being the same as another bug report filed the same day by Steinar H. Gunderson, though this one did not involve hibernation. The common thread was null pointer dereferences provoked by memory pressure. The offending patch was identified by Linus almost immediately; it's worth taking a look at what that patch did.

Way back in 2004, LWN covered the addition of the anon_vma code; this patch was controversial at the time because the upcoming 2.6.7 kernel was still expected to be an old-style "stable, no new features" release. This patch, a 40-part series which fundamentally reworked the virtual memory subsystem, was not seen as stable material, despite Linus's attempt to characterize it as an "implementation detail." Still, over time, this code has proved solid and has not been changed significantly since - until now.

The problem solved by anon_vma was that of locating all vm_area_struct (VMA) structures which reference a given anonymous (heap or stack memory) page. Anonymous pages are not normally shared between processes, but every call to fork() will cause all such pages to be shared between the parent and the new child; that sharing will only be broken when one of the processes writes to the page, causing a copy-on-write (COW) operation to take place. Many pages are never written, so the kernel must be able to locate multiple VMAs which reference a given anonymous page. Otherwise, it would not be able to unmap the page, meaning that the page could not be swapped out.

The reverse mapping solution originally used in 2.6 proved to be far too expensive, necessitating a rewrite. This rewrite introduced the anon_vma structure, which heads up a linked list of all VMAs which might reference a given page. So a fork() also causes every VMA in the child process which contains anonymous pages to be added to a the list maintained in the parent's anon_vma structure. The mapping pointer in struct page points to the anon_vma structure, allowing the kernel to traverse the list and find all of the relevant VMA structures.

This diagram, from the 2004 article, shows how this data structure looks:

This solution scaled far better than its predecessor, but eventually the world caught up. So Rik van Riel set out to make things faster, writing this patch, which was merged for 2.6.34. Rik describes the problem this way:

In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page.(个人标注，注意这段话中所描述的情况，就是老的版本中使用anon_vma将所有的vm_area_struct通过anon_vma_node连接起来的情况，这种情况下，可能出现一个进程fork了1000个子进程,每个子进程COPY了一个包含1000个匿名页面的VMA，刚开始的时候这1000个vma在同一个anon_vma表中，然后全部进行COW(写时复制)，注意，这是并没有改变这个anon_vma链表，所以说COW是不会改变anon_vma链表的)

注：我们以一页为例，如果fork1000个子进程后，该开始的vma中的匿名页都是一样的，这时候有一个子进程进行了COW，则产生一个不同的page，其实这个时候这个page所指向的anon_vma没有变化，因为这个page所在的vma没有改变，而每个vma只会存在在一条anon_vma链中，所以它虽然进行了COW，但是不会改变所指向的anon_vma,这也就是上面所说的问题出现的原因。

Essentially, by organizing all anonymous pages which originated in the same parent under the same anon_vma structure, the kernel created a monster data structure which it had to traverse every time it needed to reverse-map a page. That led to the kernel scanning large numbers of VMAs which could not possibly reference the page, all while holding locks. The result, says Rik, was "catastrophic failure" when running the AIM benchmark.

Rik's solution was to create an anon_vma structure for each process and to link those together instead of the VMA structures. This linking is done with a new structure called anon_vma_chain:

    struct anon_vma_chain {
	struct vm_area_struct *vma;
	struct anon_vma *anon_vma;
	struct list_head same_vma;
	struct list_head same_anon_vma;
    };

Each anon_vma_chain entry (AVC) maintains two lists: all anon_vma structures relevant to a given vma (same_vma), and all VMAs which fall within the area covered by a given anon_vma structure (same_anon_vma). It gets complicated, so some diagrams might help. Initially, we have a single process with one anonymous VMA:

个人标注：从下面的图中的连接线和上面的定义可以看出，same_vma连接的是和给定的vma相关的所有anon_vma结构，使用红线连接；而same_anon_vma连接的是被一个给定的anon_vma覆盖的所有vmas,使用蓝线连接；

其实所有这种AV结构和vmas结构的连接都是通过AVC进行连接的，之所以说红线连接的是anon_vma结构，是因为红线形成的环中出了vma外就是AVC，而每个AVC必定和一个AV相连，这个环中的vma就是定义中所得给定的vma，而AV结构就是通过环中的AVC连接起来的；

同理，蓝色环中肯定只包含一个AV，其余的都是AVC，每个AVC连接一个不同的VMA，这个AC就是定义中指定的给定的anon_vma，而其覆盖的所有vmas就是通过AVC连接起来的；

注意，这里只是着重介绍AVC结构，事实上VMA也有一个指针指向AV；可以参考http://www.bubuko.com/infodetail-452207.html和vm_area_struct结构体中的定义；

Here, "AV" is the anon_vma structure, and "AVC" is the anon_vma_chain structure seen above. The AVC links to both the anon_vmaand VMA structures through direct pointers. The (blue) linked list pointer is the same_anon_vma list, while the (red) pointer is the same_vma list. So far, so simple.

Imagine now that this process forks, causing the VMA to be copied in the child; initially we have a lonely new VMA like this:

The kernel needs to link this VMA to the parent's anon_vma structure; that requires the addition of a new anon_vma_chain:

Note that the new AVC has been added to the blue list of all VMAs referencing a given anon_vma structure. The new VMA also needs its own anon_vma, though:

Now there's yet another anon_vma_chain structure linking in the new anon_vma. The new red list has been expanded to contain all of the AVCs which reference relevant anon_vma structures. As your editor said, it gets complicated; the diagram for the 1000-child scenario which motivated this patch will be left as an exercise for the reader.

When the fork() happens, all of the anonymous pages in the area point back to the parent's anon_vma structure. Whenever the child writes to a page and causes a copy-on-write, though, the new page will map back to the child's anon_vmastructure instead. Now, reverse-mapping that page can be done immediately, with no need to scan through any other processes in the hierarchy. That makes the lock contention go away, making benchmarkers happy.

The only problem is that embarrassing oops issue. Linus, Rik, Borislav, and others chased after it, trying no end of changes. For a while, it seemed that a bug causing excessive reuse of anon_vma structures when VMAs were merged could be the problem, but fixing the bug did not fix this oops. Sometimes, changing VMA boundaries with mprotect() could cause the wrong anon_vma to be used, but fixing that one didn't help either. The reordering of chains when they were copied was also noted as a problem...but it wasn't the problem.

Linus was clearly beginning to wonder when it might all end: "Three independent bugs found and fixed, and still no joy?" He repeatedly considered just reverting the change outright, but he was reluctant to do so; the solution seemed so tantalizingly close. Eventually he developed another hypothesis which seemed plausible. An anonymous page shared between parent and child would initially point to the parent's anon_vma:

But, if both processes were to unmap the page (as could happen during system hibernation, for example), then the child referenced it first, it could end up pointing to the child's anon_vma instead:

If the parent mapped the page later, then the child unmapped it (by exiting, perhaps), the parent would be left with an anonymous page pointing to the child's anon_vma - which no longer exists:

Needless to say, that is a situation which is unlikely to lead to anything good in the near future.

The fix is straightforward; when linking an existing page to an anon_vma structure, the kernel needs to pick the one which is highest in the process hierarchy; that guarantees that the anon_vma will not go away prematurely. Early testing suggests that the problem has indeed been fixed. In the process, three other problems have been fixed and Linus has come to understand a tricky bit of code which, if he has his way, will soon gain some improved documentation. In other words, it would appear to be an outcome worth waiting for.

Comments (34 posted)

Patches and updates

Kernel trees