Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

[AMD vs. Intel] Breaking Performance Bottlenecks of SMP Systems with Opteron
Van's Hardware ^ | July 29, 2002 | Spencer Kittelson

Posted on 07/30/2002 9:46:22 AM PDT by JameRetief

Breaking Performance Bottlenecks of SMP Systems with Opteron
By Spencer Kittelson

Date: July 29, 2002

One of the most wonderful performance enhancing features of the forthcoming [AMD] Opteron (Hammer) are the multiple independent direct memory channels that are built into each CPU.  This is a huge, huge difference from the shared memory (actually, shared everything) approach of Intel symmetric multi-processing (SMP) systems.  Given the limitations of today's memory technology, this was an exceedingly smart move on the part of AMD and will likely change the way we design some of our applications in the very near future.

Intel SMP systems exhibit terrible scalability beyond as few as 3 CPU's if programs and data don't fit into a local CPU's cache (unless the programs are designed from the start for parallelism). This is largely why Intel CPU's are being equipped with larger and larger cache sizes.  Even large cache increases don't really have very high scalability on SMP shared memory systems since cache coherency maintenance between multiple CPU's is still a problem with serious limitations.  The bus snooping and resultant cache flush/fill activity often cripple a given CPU for many cycles of operation.  In addition, modern OS's use globally available data structures to coordinate activity among independent CPU's.  The need for atomic test/set operations to support spinlocks and semaphores can chew up lots of memory bandwidth and suffers potentially severe latency delays when dirty cache needs flushing (thus forcing MMU's to resynchronize their cache images of shared memory).  The latency just kills these systems for many, many applications.  That is why scalability is so poor as can be seen in various tests running everyday apps on everyday OS's such as Unix and Windows NT/2K (when the fourth CPU only delivers a 10% performance kick the term "scalable" is a complete misnomer). Generally, the apps themselves are simply not designed to efficiently run multi-threaded across multiple CPU's (they are architecturally limited).  It's not just the applications that have problems with SMP.  The OS kernels are often a nightmare with critical code segments heavily dependent on a given hardware design (memory controller, cache subsystems, etc.) and with full SMP there is LOTS of activity competing for the available latency and bandwidth of the memory system.


Of course, a multi-threaded OS that is running on a multi-CPU system (actually, it's multi-process also) still needs some form of shared resource control and in the case of multi-threading some way of doing those spinlocks and semaphores.  Most of our currently used software is capable of using multi-threading over multiple CPUs but is essentially designed for the shared memory model.  How can we get past the performance bottleneck?  Thread affinity and other techniques have been developed to help deal with the issue but the shared memory model is just plain the primary problem.  There is only so much bandwidth to share and it gets used up fast. It's clear that memory bandwidth is generally THE limiting factor on SMP systems.  Thus, to get away from the problem we will need to get away from SMP.  It's that simple.

You could see this problem coming years ago as CPU speeds zoomed past memory performance which has plodded along at a sub-Moore development rate.  There simply isn't enough memory performance in SMP to keep up with CPU's.  For high performance and high volume applications (transaction processing, rendering, searching, etc.) there MUST be division of the workload among multiple separate memory systems.  There are several ways to do this and currently they tend to be rather expensive.  One way to get a lot of performance out of a multi-CPU configuration is to put the CPU's into separate systems.


When data centers are designed to handle massive loads, we often utilize separate systems that operate together in "clusters".  Perhaps the company best at doing this was the now absorbed Digital Equipment Corporation (good old DEC).  Their VMS operating system had a built in clustering capability that allowed multiple separate systems to share their resources in nearly complete harmony (with a significant amount of overhead for the inter-system management to take place).  Each system of course had its own OS image and attached resources (which could be shared as desired) and various enhancements to the OS which allowed such things as distributed lock management, failover of applications, etc. This allowed any given system to "see" the resources of another and by passing an internally generated request to another system (or virtualizing it) gain access to the other resources.  The point here is that each system ran its own apps and utilized the storage and I/O systems of other systems.  It is pretty amazing when done properly (and you do have to have some skill at application development and cluster management to make it work well).  We can do this with other operating systems, typically Linux (and IBM mainframes can too) and Microsoft is trying very hard to do the same (it ain't easy and their OS's are already such a kluge it may take them years yet to get it right!).  The main benefit of clusters is that we don't have shared memory issues and the main drawback is that we still have distributed I/O performance issues, usually due to the bandwidth limits of the system interconnect.


Of course, clusters tend to be somewhat expensive and power hungry since we need to create completely separate systems and then link them with gigabit ethernet, fiber channel or some other high performance LAN that acts as a "backplane" for the systems to communicate over. (DEC used a thingy call CI or "cluster interconnect" that was very fast.) They work great but the cost of each box and the expense of hooking it together with state-of-the-art networking gear is quite high.  Of course there is also the power and cooling expense to consider.  If we could dream of a "perfect" system, what would we really need?

What we really want are the best features of a cluster without all the expense in hardware and energy cost.  We also don't want just any old SMP system with its bandwidth and latency limitations. What will really help eliminate the SMP bottleneck issues is to have the CPU's do their own thing (just like any given system in a cluster) and then only communicate with another CPU when it must (just like in a cluster).  With AMD's high end Opteron CPU's a lot of the cost can be eliminated and the whole idea of running applications on such a platform may change the way we "do" computing forever.

If you've looked at some functional block diagrams of AMD's intended 4-way high end Opteron systems, (see this .pdf file for reference) you will note that all 4 CPUs are interconnected via HT links (Hyper Transport).  One of the CPU's is intended to be dedicated to serving as the primary display and I/O resource manager while the other three are dedicated to process or thread execution.  One of the potential problems with this is that in most current OS designs process/thread scheduling is controlled by a single CPU and this can create more inefficiencies by requiring significant amounts of inter-process and thus inter-processor communication (IPC).  There simply isn't any free lunch. But what if we didn't need lunch so often?  If we just turned each CPU loose on it's own set of problems and let it manage them by itself, it would only have to "talk"

to another CPU when it needed I/O or access to some mutually agreed to shared resource. Hmmm, this is starting to sound a bit like a cluster....

If we extend the concept a bit, why don't we just go ahead and provide a full mini-kernel to each CPU and let that CPU schedule and manage it's own list of processes?  That can cut down significantly on IPC requirements and for certain applications (that are written properly) eliminate cache coherency and shared memory latency/bandwidth issues.  There's still the shared I/O to deal with but at least that is handed off to a completely separate CPU.  In the Opteron architecture, the HT links allow extremely efficient I/O transfers to the buffer memory of a CPU that wants to move data in/out.  As Nils has pointed out, this is almost a mainframe class "channel" architecture (and it is very, very efficient).  The HT channels can each handle over a giga-BYTE per second of data transfer and that's enough to gobble up the output of over a dozen striped disk drives simultaneously (if that isn't enough, just wait for the next rev of HT!).   Each CPU's cache does its own thing with its own list of processes and cuts down dramatically on inter-CPU IPC.


Applications generally would run completely within a given CPU and should be designed for CPU thread affinity.  This will keep the garden variety application design similar to how we do things today and keep things simple.  This isn't much of a problem since these CPUS are smokin' anyway.  However, when there is a compelling need for wringing out performance (such as OO and Relational DBMS, rendering, etc.) we can easily design threads that should run more or less independently on separate CPU's and then coordinate their intermediate results over those HT channels (Linux Beowulf clusters often do this already so the techniques are well known).

These are not new ideas but the imminent arrival of AMD's Opteron server class CPU's begs for a more advanced OS architecture than we are currently using in shared memory SMP systems.  We will not be able to get the best from these systems without a superior OS design.  Is there one available?

If you've followed so far you'll be interested in this

heavily footnoted paper by Karim Yaghmour of Opersys.  He suggests and analyzes what it will take to get a Linux kernel running separately in a multi-CPU environment.  There is no mention of Hammer or Opteron but the concepts can easily be applied and without doubt more efficiently on the Opteron architecture than on Intel SMP.

Of course, the above doesn't even begin to outline the plethora of issues involved but it at least highlights where some of the biggest problems currently are and how the AMD Opteron architecture can potentially provide a significant advance.  Think on ya'll!


TOPICS: Business/Economy; Culture/Society; Editorial; News/Current Events; Technical
KEYWORDS: amd; bottlenecks; clusters; hyperthreading; intel; memory; multiprocessors; sharing; techindex

1 posted on 07/30/2002 9:46:22 AM PDT by JameRetief
[ Post Reply | Private Reply | View Replies]

To: Ernest_at_the_Beach; rdb3
Thought this might be of interest to you.
2 posted on 07/30/2002 9:47:28 AM PDT by JameRetief
[ Post Reply | Private Reply | To 1 | View Replies]

To: JameRetief
Yes, I was going to cluster a few mini-kernels related to Relational DBMS (interconnecting them via HT links, of course).

But then I realized that due to low scalability, my system was sufferinig potentially severe latency delays from dirty caches.

This of course thus forced the MMU's to resynchronize!

Conclusion: it's all Hillary's fault.

3 posted on 07/30/2002 10:03:24 AM PDT by governsleastgovernsbest
[ Post Reply | Private Reply | To 1 | View Replies]

To: JameRetief
Interesting how this article lauds VMS and slams Microsoft's current operating systems, when in truth they were both designed by the very same individual, Dave Cutler.
4 posted on 07/30/2002 10:39:56 AM PDT by BuddhaBoy
[ Post Reply | Private Reply | To 1 | View Replies]

To: BuddhaBoy; All
How does the proposed "Digital 'Rights'" encryption overhead impact the throughput, and hence the WASTE of Electricity inherent in crippling the computer??
5 posted on 07/30/2002 11:52:35 AM PDT by Lael
[ Post Reply | Private Reply | To 4 | View Replies]

To: JameRetief; *tech_index; Mathlete; Apple Pan Dowdy; grundle; beckett; billorites; One More Time; ...
Thanks for the ping!

Shades of the old IBM Mainframe VM -Virtual Machine operating System!

To find all articles tagged or indexed using tech_index

Click here: tech_index

6 posted on 07/30/2002 12:07:03 PM PDT by Ernest_at_the_Beach
[ Post Reply | Private Reply | To 2 | View Replies]

To: JameRetief
Found a nice description of what Karim Yaghmour of Opersys gets into heavily in the paper referenced above!

Yaghmour: A practical approach to Linux clusters on SMP hardware

Yaghmour: A practical approach to Linux clusters on SMP hardware
Jul. 24, 2002

Karim Yaghmour writes . . .

In continuing my work on the Adeos nanokernel and following many discussions at the OLS, I concentrated on the idea of running multiple Linux kernels in parallel on SMP hardware in order to obtain SMP clusters. What started as a high-level investigation eventually turned out to be an in-depth search for a viable architecture.

At this point, I think I have figured out the exact components required (including how to boot multiple independent Linux kernels on SMP hardware), their interactions, and the additions to be developed. I wrote a paper (see link at the bottom) detailing these issues in order to encourage discussion over the ideas and techniques.

Instead of a high-level overview or an explanation of the virtues of SMP clusters, this paper presents many low-level implementation details and outlines the exact steps required to obtain a fully functional system.

To get there, I looked at a lot of previous work. Running multiple OS instances on the same hardware to enhance scalability is an idea that has actually been investigated before. The most documented example is Disco (which was implemented by the very folks that eventually brought us VMWare). Disco is a virtual machine monitor that enables multiple IRIX kernels to coexist on the same hardware.

The main "flaw" of the approach is that the authors started with assumption that modifying the OS to run multiple copies of it wasn't permitted. The architecture described in my paper is based on the assumption that minor kernel modifications are permitted as long as they are very isolated and easily removable. Hence, no virtualization whatsoever.

These are the main features of the architecture being presented:
  • No changes to the kernel's virtual memory code.
  • No changes to the kernel's scheduler.
  • No changes to the kernel's lock granularity.
  • Minimal low-level changes to kernel code.
  • Reuse of many existing software components.
  • Short-term accessibility.
Paper title: A Practical Approach to Linux Clusters on SMP Hardware

Paper outline:
  1. Introduction
  2. Previous work
  3. Overall system architecture
  4. Adeos nanokernel
  5. Kernel-mode bootloader
  6. Linux kernel changes
  7. Virtual devices
  8. Clustering and single system image components
  9. Work ahead
  10. Caveats and future work
  11. Conclusion
The paper is 12 pages long, including references, and should make for an interesting read even if you aren't interested in clustering or scalability. I could go on and provide more details, but if you're still reading this, then you better grab the paper and have a look for yourself. Here's the paper in various formats: Postscript / PDF / HTML

Feedback and suggestions on the architecture being presented are encouraged.

Best regards,
Karim



Related stories:

7 posted on 07/30/2002 12:18:15 PM PDT by Ernest_at_the_Beach
[ Post Reply | Private Reply | To 2 | View Replies]

To: JameRetief
I guess this means that Beowulf builders should use AMD powered boxes instead of Intel?
8 posted on 07/30/2002 1:57:19 PM PDT by anymouse
[ Post Reply | Private Reply | To 1 | View Replies]

To: JameRetief
With memory bandwidth becoming the limiting factor, I'm wondering if this will shift the advantage from RISC architecture back to CISC, on the viewpoint that an architecture that produces "denser" code will run faster.

Given the fair density of Java bytecodes, I'm wondering how fast Java will run on these high-end machines once the cache grows big enough that the JVM mostly resides in the cache.

9 posted on 07/30/2002 4:53:19 PM PDT by SauronOfMordor
[ Post Reply | Private Reply | To 1 | View Replies]

To: anymouse
I guess this means that Beowulf builders should use AMD powered boxes instead of Intel?

They already are. Here is one that I know of and it was installed in January of last year: University of Delaware installs AMD based Beowulf cluster . 128 AMD Athlon processors in the cluster.

10 posted on 07/30/2002 8:54:38 PM PDT by JameRetief
[ Post Reply | Private Reply | To 8 | View Replies]

To: SauronOfMordor
Actually RICS never really panned out the way people thought it would. Apple is the only major player using it. However both Intel P4 and AMD Athlon and Hammer chips convert most instructions into RICS code internally then executes this RICS code.
11 posted on 07/30/2002 9:38:42 PM PDT by ImphClinton
[ Post Reply | Private Reply | To 9 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson