Operating System: Three Easy Pieces --- Lock Concurrent Data Structures (Note)

Before moving beyong locks, we will first describe how to use locks in some common data

structures. Adding locks to a data structure to make it usable by threads makes the structure

thread safe. Of course, exactly how such locks are added determines both the correctness

and performance of the data structure. And thus, our challenge:

          CRUX: How To Add Locks To Data Structures

When given a particular data structure, how shoule we add locks to it, in order to make it work

correctly? Further, how do we add locks such that the data structure yields high performance,

enabling mang threads to access the structure at once, i.e., concurrently?

Of course, we will be hard pressed to cover all data structures or all methods for adding 

concurrency, as this is a topic that has been studied for years, with literally thousands of 

research papers published about it. Thus, we hope to provide a sufficient introduction to the 

type of thinking required, and refer you to some good resources of material for further inquiry

on your own. We found Moir and Shavit's survey to be a great source of information.

              Concurrent Counters

One of the simplest data structures is a counter, it is a structure that is commonly used and has

a simple interface. We define a simple nonconcurrent counter in Figure 29.1.

              Simple But Not Scale

As you can see, the non-synchronized counter is a trivial data structure, requiring a tiny amount

of code to implement. We now have our next challenge: how can we make this code thread safe

? Figure 29.2 shows how we do so.

This concurrent counter is simple and works correctly. In face, it follows a design pattern

common to the simplest and most basic concurrent data structures: it simply adds a single 

lock, which is acquired when calling a routne that manipulates the data structure, and is

released when returning  from the call. In this manner, it is similar to a data structure built

with monitors, wherer locks are acquired and released automatically as you call and return

from object methods.

At this point, you have a working concurrent data structure. The problem you might have is

performance. If your data structure is too slow, you will have to do more than just add a single

lock; such optimization, if needed, are thus the topic of the rest of the chapter. Note that if the

data structure is not too slow, you are done! No need to do something fancy if something simple

will work.

To understand the performance costs of the simple approach, we run a benchmark in which

each thread updates a single shared counter a fixed number of times; we then vary the number

of threads. Figure 29.3 shows the total time taken, with one or four threads active; each

thread updates the counter one million times. This experiment was run upon an iMac with four

Intel 2.7GHz i5 CPUs; with more CPUs active, we hope to get more total work done per unit 

time.

From the top line in the figure (labeled precise), you can see that the performance of the

synchronized counter scales poorly. Whereas a single thread can complete the million counter

updates in a tiny amount of time (roughly 0.03 second), having two threads each update

the counter one million times concurrently leads to a massive slowdown (taking over 5

seconds!). It only gets worse with more threads.

Ideally, you would like to see the threads complete just as quickly on multiple processors as the

single thread does on one. Achieving this end is called perfect scaling; even though more work

is done, it is done in parallel, and hence the time taken to complete the task is not increased.

                    Scalable Counting

Amazingly, researchers have studied how to build more scalable counters for yeas. Even more

amazing is the fact that scalable counters matter, as recent work in operating system 

performance analysis has shown; without scalable counting, some workloads running on Linux

suffer from serious scalability problems on multicore machines.

Though many techniques have been developed to attacj this problem, we will now describe one 

particular apparoach. The idea, introduced in recent research, is known as a sloppy counter.

The sloppy counter works by representing a single logical counter with numerous local physical

counters, one per CPU core, as well as a single global counter. Specifically, on a machine with

four CPUs, there are four local counters and one global one. In addition to these counters, 

there are also locks: one for each local counter, and one for global counter.

The basic idea of sloppy counting is as follows. When a thread running on a given core wishes

to increment the counter, it increments its local counter; access to this local counter is 

synchronized via the corresponding local lock. Because each CPU has its own local counter, 

threads across CPUs can update local counters without contention, and thus counter updates

are scalable.

However, to keep the global counter up to date (in case a thread wishes to read its value), the

local values are periodically transferred to the global counter, by acquiring the global lock and

incrementing it by the local counter's value; the local counter is then reset to zero.

How often this local-to-global transfer occurs is determined by a threshold, which we call S here

(for sloppiness). The smaller S is, the more the counter behaves like the non-scalable counter

above; the bigger S is, the more scalable the counter, but the further off the the global value

might be from the actual count. One could simply acquire all the local locks and the global lock (

in a specified order, to avoid deadlock) to get an exact value, but that is not scalable.

To make this clear, let's look at an example. In this example, the threshold S is set to 5, and

there are threads on each of four CPUs updating their local counters L1, L2, L3, and L4. The

global counter value (G) is also shown in the trace, with time increasing downward. At each

time step, a local counter may be incremented; if the local value reaches the threshold S, the

local value is transferred to the global counter and the local counter is reset.

The lower line in Figure 29.3 (labeled sloppy, on page 3) shows the performance of sloppy

counters with a threshold S of 1024. Performance is excellent; the time taken to update the

counter four million times on four processors is hardly higher than the time taken to update it

one million times on one processor.

Figure 29.6 shows the importance of threshold value S, with four thread each incrementing

the counter 1 million times on four CPUs. If S is low, performance is poor (but the global

count is always quite accurate); if S is high, performance is excellent, but the global count

lags (by the number of CPus multiplied by S). This accuracy/performance trade-off is what

sloppy counters enables.

A rough version of such a sloppy counter is found is in Figure 29.5. Read it, or better yet,

run it yourself in some experiments to better understand how it works.

 

转载于:https://www.cnblogs.com/miaoyong/p/4932054.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值