Problem: How to coordinate activities of thousands of cores

Problem: How to coordinate activities of thousands of cores
Research Problem on 1000-cores architecture and OS
Ying Li
This problem is really macro in scope and challenging to solve even with careful analysis and sufficient insights in
both the parallel architectures and system software. I try to analyze the problem in several aspects, but there are so
many stochastic parameters affecting the direction of the trend, so maybe some predications may fail in the future. As
the essay style writing becomes larger and larger, I have to divide it into a series of passages for potential readers to
make a plan to read it.
Outline
Passage 1 introduces the topic from the perspective of synchronization and communication methods;
Passage 2 introduces the hardware architecture revolution and the incoming problems for operating systems;
Passage 3 analyzes the Problem 1 philosophically and introduces the philosophical framework aims to address P1;
Passage 4 analyzes traditional time-sharing OS under the philosophical framework and introduces 2OCN principle;
Passage 5 analyzes two CC-convergent monolithic family OS instances from UC Berkeley: ROS and Tessellation;
Passage 6 analyzes a Non-CC-convergent microkernel family OS from ETH Zurich: Barrelfish; discusses coherence;
Passage 7 analyzes another Non-CC-convergent microkernel family OS from MIT: fos; discusses 1000-cores chip;
Passage 8 (to come) analyzes an experimental hardware platform from Intel: SCC - Single-Chip-Cloud;
Passage 9 (to come) analyzes another experimental hardware platform from MIT: ATAC
Passage A (to come) analyzes a hardware platform product from TILERA: TileraGX100
Passage B (to come) analyzes a parallel programming model: MPI; investigates message-passing in 1000-core chip;
Passage C (to come) analyzes a parallel programming model: OpenMP; investigates shared-memory in 1K-core chip;
Passage D (to come) analyzes a parallel programming mode: PVM; I will highly regard this model as potential target;
Passage E (to come) discuss what system software researchers can do in the architecture revolution in general;
Passage F (to come) will open a gate to a DYNAMIC series of fine grain problems to RESEARCH in details;
Passage G the END of this series of passages.
Passage 1
Classic concurrency and parallel programs involve two basic forms of processor/core synchronization: mutual
exclusion and producer-consumer. Both of them seem to be implemented conventionally using locks which spin to
wait the value stored in the shared memory through private caches. As we know, the DRAM-caches could reduce
accessing latency and mitigate bandwidth contention, but introduce coherence problem. The accessing path stated
above and the spin nature will inherently cause coherence traffic and waste power. So, traditional locks depend on
cache coherence. As the number of cores increases, coherence traffic will make the memory bandwidth contention
more severe. Other methods such as transactional memory and message passing could be used to solve mutual
exclusion and producer-consumer synchronization separately. But for traditional OS kernels which rely on shared
memory (shared address space) heavily, scaling to harness many cores with more complicated and structural
cache/memory hierarchy without major redevelopment will be difficult even impossible, although some researchers
in MIT found they could remove some kernel bottlenecks by modifying the kernel and applications slightly to support
up to 48-cores in shared address space.
Consider the two basic communication methods between cores/processors: shared-memory and message passing
provided by parallel hardware architectures including SMPs, ccNUMAs, and custom clusters. According to John L.
Hennessy and David A. Patterson, shared-memory based message passing implementations are classis but the
reverse forms are hard. But now hardware designers tend to abandon cache coherent shared address space support,
arguing that they hit the coherency wall in OCN-based many-core architectures. We could implement software
managed cache coherence protocols to support a global coherent shared address space just like what we did in DSM
architecture, but limitations of hardware support make this problem challenging. Some hardware researchers even
promote to give up coherence or avoid coherence-relied programming paradigms through message passing or other
forms of sharing but not strictly coherent. For traditional general purpose application programmers, the leak of
cache coherency shared memory makes their life hard (at least reduces their programming productivity) if they are
not ready to write parallel programs using MPI-like message passing library which mainly aims for high-performance
computing originally.
Passage 2
There are so many reasons for which traditional OSes cannot scale up 1000-core architecture without even with
coherent cache/memory accessing path which is a revolution in hardware architecture. This revolution may impact
the whole software stack up to the top layer applications which directly interact with end users such as graphical
financial analytics, 3D games and tools like IDE, even basic programming paradigms may switch from concurrency
(time-sharing, sometimes) to true parallel manners which are common in HPC and scientific computing.
System software plays a very important role in this architecture and programming paradigm revolution, because it is
the bridge between the low-level hardware and high-level applications. Meanwhile it is challenging due to the
bi-direction nature. Researchers need to harness the more complex inherently parallel hardware and provide simple
and easy-to-program interfaces and stable run-time environment to applications. The problem becomes severe
because neither of the two ends is mature nor clear. The two tasks naturally left to operating systems. I retype them
below:
P1: Harness the new parallel architecture to coordinate 1000 cores which is actually a MPP embedded in an OCN
P2: Provide efficient run-time environment and effective programming support.
Notes: MPP: Massive Parallel Processors; OCN: On-Chip Network; efficient: high performance/core; effective: easy to use;
Passage 3
A THOUSAND is relatively a large number in human minds, not in dollar but in management from psychological and
philosophical perspective. To harness such an amount of activities or behaviors, the method which I can think out is
divide-and-conquer. But dividing is a process from the most 1000 to the least 1, so what critical is when to stop and
how to conquer the resulting structural dividing tree. The philosophy is whether we need a central coordinator or
government to control the behavior of the whole tree. For traditional OSes which have a monolithic kernel, the
governing role is played by the scheduler which coordinating or scheduling all application processes by distributing
time slices to them according to some clever strategies which could balance the needs of different forms. This is a
scenario just like republic countries, the second form is granting freedoms to the fine grained individuals just like
united countries. In computing world, distributed systems based on network of different sizes share this nature, for
example, the Internet which is really a self-governed system. Research pioneers take advantage of this idea by
introducing the concept of microkernel into OS design space, which makes the services provided by traditional OS in
the form of system calls which based on context switch separated to standalone servers/daemons which are
inherently parallelism concepts. In the single-processor/multi-processors era, servers within an OS-scope box are
naturally different from servers in network, because they execute concurrently, which is not a true parallel form.
Maybe this is also the reason why microkernel OS did not prevail in industrial OS market. But the prime time for
microkernel seems coming due to the trend 1000-core architecture. I have to summarize the track of analysis here in
order talk further. The philosophy framework of OS design:
The philosophical method of coordinating behavior of 1000 cores is Divide-And-Conquer; the critical technique is
H2D: how to divide and H2C: how to conquer. How-to-conquer depends on how-to-divide, the critical decision is
whether a CC: central coordinator is considered. CC is the key tool of traditional monolithic OS. Non-CC/Loose-CC
is one of the key philosophies of microkernel OS.
Notes: concepts here are the personal view of the author. Non-CC for microkernel may not accurate, a deeper investigation is needed.
Passage 4
We can see the philosophy from some historical perspective. In the incoming many-core era, the most basic
computing resource – processors/cores will become not sparse, the ratio of processor to user (P/U) will be increased
greatly by the invocation of 1000-cores chip. But in the past single processor era, processor is unique, demands of
users or applications are many in account, and the problem of how to divide (H2D) seems hard: how to divide one
processor? Of cause we cannot divide one processor in amount, but we can divide the computing resource in time,
just because the processor actually is the program-counter which is a discrete sequence of numbers mathematically,
here the memory addresses in F architecture. Time-sharing opens the era of concurrent programming /
multiprogramming, a novel module called scheduler in UNIX plays the role of CC which also accomplish the hard task
of H2D in that time. In the current multi-processor era, UNIX-likes and other traditional OSes which based on
monolithic kernels scale to several cores easily, because the change from one to four or six is not severe or in essence.
In addition, the networking revolution did not scare the pattern, because interconnecting happens outside of chips,
which has no reason to destroy the working order of OS kernels. Actually what we did was combining OS boxes not
dividing them. What we have to notice is the resemblance between traditional networks (or say off-chip networks,
whose abbreviation form is also OCN) and on-chip networks (OCN). A phenomenon emerged, because the
resemblance gives so many motivations to OS researchers to utilize the results from off-chip network distrusted
systems into their on-chip network OS design. Actually I think that the two OCNs are the same problem in some level.
In order to emphasize this resemblance and the phenomenon, I will use the 2OCN resemblance or 2OCN principle or
just 2OCN in later discussions. Note the microkernel was also motivated by 2OCN.
Passage 5
Before we enter into the many-core era, OS researchers have already taken into the considerations of OS revolution.
As we can predicate from the philosophy I stated above, two kinds of views existed, although maybe not so closely
convergent to the base scenarios in the philosophy. But that is sufficient to help us to draw the big picture of the
current research progress and figure out the underlying philosophy of different design decisions.
In the CC-convergent genres excluding Linux and traditional OSes (including tries by modifying them slightly for the
purpose of scaling to 1000 cores), ROS (akaros) and Tessellation are representatives. Both of them come from UC
Berkeley, and Tessellation was implemented upon ROS firstly. Their ideas resemble with each other, but Tessellation
may be more sophisticated in design. I chose to talk Tessellation below because I think it covers the other. To answer
H2D, Tessellation introduced a novel concept called Cell which is their dividing result, their H2D methodology is STP
which means Space-Time Partitions. To answer H2C, they introduce a method called f Tow-Level Scheduling which
means scheduling outside and inside the Cells. At this time, I feel happy for abstracting their design motivations
under my philosophy framework successfully (This sentence will be removed in serious publications).
Passage 6
In the Non-CC/Loose-CC- convergent genres, Barrelfish and fos are representatives. Barrelfish comes from ETH
Zurich and fos comes from MIT. Both of them employ the idea of microkernel and naturally they try to utilize 2OCN
principle.
Barrelfish introduce the concept of multikernel to emphasize the possibly diverse (or say asymmetric, from my own
understanding) architecture including memory hierarchies interconnects and even instruction sets. I think that this
probability is really small, because the complexity of integrating those many asymmetric things on a small chip will
go out of control of our human beings. Hardware researchers tend to give up coherence support for the reason of
complexity of connecting too many coherence caches and cores together. A controlled diversity may occur but not
that severe in the true OCN platforms.
Barrelfish also over emphasize the 2OCN principle that they give up shared-memory at the lowest level using pure
messaging passing instead. There is no doubt that hardware features will shape the design space of system software
especially OS. Barrelfish designers seem to try their best to get out of the limitation of hardware, especially for the
most varying part – cache coherence shared-memory. They chose the ultimate method to escape the impact from the
incoming leak or variance believing that coherence is not necessary even nor sufficient. But I have to say this kind of
design is not friendly to application programmers, what they do is just leaving the problem to upper layer programs
or other programmers.
For cache coherence, maybe we have to give it up in the future, no matter hardware managed or software managed if
we cannot find an efficient communication channel with sufficient bandwidth or clever method to separate and
balance the contention of bandwidth. This is because the nature of the problem will not change in software-managed
form which is the contention of channel bandwidth from cache/buffered storage to main memory storage. No
method can escape this obstacle as cores increase greatly in amount. Barrelfish multikernel seems to be more
portable than scalable as the researchers expected.
Passage 7
However, Barrelfish is a little more sophisticated than fos. According to the predication from fos authors, this year
(2012), we will have 2000+ cores on single chip; this is not the situation at all. The problem of scaling chip to 1000
cores is not just the problem of putting more transistors on smaller chips under the control of Moore’s Law from
Intel. The problem is how to connect cores and cores to memory in high bandwidth channels. Hardware designers
see this problem as an optimization: how to best connect cores, maybe this is the reason why they give up coherence
to interconnect more and more cores; they trade off ! Finding high bandwidth channels and connection method may
not be implemented in silicon, and maybe this is the time for physics to perform, researchers from MIT use optical as
the communication media to address the problem in a project called ATAC, the result seems promising.
Back to the design of fos which means factored operating system in space, researchers use space-sharing to replace
time-sharing in order to scale to many-core. Fleet of servers or applications bound to cores communicate through
message passing and shared-memory cross fleets and within fleets separately. This is a neutral design in
communication manners which is a little not practical when hardware implementations should be taken into account;
naturally they simulate their products in QEMU which is a virtual environment.
For designs classified as microkernel family featured with Non-CC/Loose-CC, the H2D strategy or criterion is dividing
or grouping by functions or say service types. Self-controlled servers dynamically or statically bound to cores with
little interaction with the microkernel, provide different classes of services to applications through message-passing.
For H2C, the case is really Non-CC/Loose-CC style, service providers do not have too many things to communicate
with each other and also seldom with microkernel.
The end
More passages will come soon. Thank you for reading so much! For any suggestions, please contact me at
guaguashu@gmail.com or guaguashu@fastmail.fm.
All rights reserved.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值