Skip to comments.Colossus, Cray and Blue Gene: The History of Supercomputers
Posted on 06/27/2009 3:26:30 PM PDT by texas booster
Rocky Marciano, Muhammad Ali, Joe Frazier and Mike Tyson; Colossus, Cray, ASCI Red and Blue Gene. The names of boxing's heavyweights are never forgotten - and it's the same with the champs of the supercomputing world.
These machines truly are like no others. Each is computationally more muscular than its predecessor; and for a while, each has claimed the title of the fastest computer in the world. But, as the calamitous fall of 'Iron' Mike Tyson showed us, champions are built to be felled. And so we've seem supercomputers come and go, growing from single processor machines capable of a few thousand operations per second to systems like IBM's Roadrunner and the Cray XT Jaguar, the latter boasting a massive array of 45,000 AMD Opteron processors.
As you read on, we'll explore the pantheon of computing's biggest and hardest hitters, and answer many of supercomputing's most tantalising and fundamental questions. Exactly why do we need all this power? How much electricity does it take to power a supercomputer? What types of technology will tomorrow's supercomputers use? And how did it all begin? Round one begins now. The rise of fast machines
For some time, the history of the fledgling supercomputer was the history of computing itself. At the dawn of the digital age, devices like the Colossus Mark 1 and 2, and ENIAC filled entire rooms. They existed simply to crunch numbers far beyond human abilities. The term 'supercomputer' didn't enter common parlance until the 1960s, and it's often associated with just one famous individual - Seymour Cray. Cray's name is virtually synonymous with the supercomputer. He started designing machines while working for Control Data Corporation (CDC), a company that had produced the fastest computers in the world for nearly a decade. Cray set himself the goal of creating a computer 50 times faster than the quickest system being sold by CDC at the time, the 48-bit 1604. It took him years, causing some consternation among CDC's management, but in 1964 the CDC 6600 came on the market.
Until the 1960s, computer processing power was measured by how many thousands of operations per second (OPS) a computer could perform. Colossus sported 5,000 OPS, ENIAC 100,000 OPS, and the fastest machine of the 1950s - IBM's catchily named AN/FSQ-7 - still only offered 400,000 OPS. By the time the CDC 6600 arrived, IBM had tripled the speed of its fastest system - the infamous 7030 Stretch - thanks to its adoption of transistors. But the CDC 6600 upped the ante. While Stretch could manage 1.2 MFLOPS (1,200,000 FLOPS), the CDC 6600 was 2.5 times faster, giving 3 MFLOPS. Note also that machines had switched from integer OPS to floating point FLOPS at the turn of the decade.
The leap in processing power given by the CDC 6600 has defined the concept of the supercomputer. Five years later, CDC made an even bigger step forward. The 7600 provided more than 10 times the performance of the 6600, giving 36 MFLOPS, and the trend continued, with the STAR-100 tripling the score in another five years to 100 MFLOPS. Within two years, Seymour Cray had broken away from CDC to form his own company. Its first product, the Cray 1, hit 250 MFLOPS in 1976.
Since then, supercomputer performance has increased by orders of magnitude every decade. The first GFLOP supercomputers (a thousand MFLOPS) arrived in the early 1980s, and the TFLOP level (a thousand GFLOPS) was exceeded by Intel's ASCI Red in 1997. In 2008, IBM's Roadrunner became the first PFLOP supercomputer, achieving another thousand-fold increase in speed. At the same time, the fastest desktop quad-core processors contained in personal computers are achieving over 50 GFLOPS - the same as supercomputers of the early 1990s.
What makes a computer super?
What made the first true supercomputers so much faster than the previous systems? The answer really is quite simple: parallelism. The CDC 6600 was still what would be called a single-processor system, with just one central processor (CP). However, this was also assisted by a series of 10 slower peripheral processors (PPs), which ran in parallel. The CP itself only handled mathematical and logic operations, while the PPs performed all of the memory and input/output tasks. Since the CP was handling a much smaller subset of operations, it could be run faster. The other important element was the switch from thermionic valves (vacuum tubes) to transistors, which offered faster switching speeds. These factors taken together meant that the CDC 6600's CPU could run at 10MHz while other supercomputers of the day were operating at around 1MHz.
Since memory at that time was around 10 times faster than most supercomputer CPUs, the CDC 6600's architecture ensured that operations took full advantage of the bandwidth. The CDC 6600's PPs were each allowed access to the CP for one tenth of the time. So, although these were running slower than the CP, they were able to keep data flowing. The CDC 6600's CP also contained 10 function units internally, which enabled it to work on instructions in parallel. This was the first implementation of a superscalar processor design.
The idea of parallelism has continued to dominate the structure of supercomputers since the CDC 6600. It requires careful programming, mostly because the code has to be split up so that it can run in simultaneous chunks. The next Cray design introduced pipelining, a technique where an instruction unit is broken up into stages so that it can begin work on a new instruction before it has finished the last one. Superscalar designs with multistage pipelines are now de rigueur in modern desktop processors. VIA's C7 and Intel's Atom are notable non-superscalar exceptions. Following the vector
The next CDC development introduced another important element that has defined the expansion of supercomputers ever since: vector processing. This technique sees a single operation being performed on multiple data sets at once. The first system from Seymour Cray's own company - the Cray-1 - used vector processing with the addition of registers. These additions allowed it to apply multiple operations on the same data at once, and necessitated separate vector hardware - something that has been added to desktop CPUs in the form of secondary Single Instruction Multiple Data (SIMD) logic for the last decade.
Vector processing has remained the core structure of supercomputer CPUs. The only major additions have been multiprocessing and clustering, which are different levels of essentially the same thing. Multiprocessing groups multiple CPUs into a single computer (also known as a 'node'), while clustering groups together multiple nodes. The multiprocessor computers can work on multiple streams of data using vector subsystems, so they are called Multiple Instruction Multiple Data systems. So while different supercomputer companies put a varying number and type of CPUs in each node and use a varying number of nodes in their clusters, the overall approach is almost universal.
The upshot of this is that the CPU design itself is no longer the focus of attention. Instead, manufacturers concentrate on how the CPUs are connected together. For example, non-uniform memory access (NUMA) has become a mainstay in supercomputing, particularly with processor designs that include on-die memory controllers.
In the first few decades of the supercomputer, memory was faster than processors, which was one of the main reasons behind the new design created for the CDC 6600. But nowadays CPUs are faster than memory, and this is even more of a problem if memory is shared across lots of processors.
NUMA alleviates this problem by giving each processor its own local memory. But rather than making this entirely discrete, processors can access each other's local memory. The memory and cache controllers associated with each processor must also communicate to maintain coherency. Otherwise, changes in data held locally would not be recognised when the same data is worked on by another processor. Fast connectivity between processors is therefore a necessity. The need for speed
Now that we've covered all of the developments in supercomputing over the last five decades, it's probably time to mention why we even needed to build supercomputers in the first place. Put simply, we need them to perform calculations that are beyond our capabilities. The first computers were developed during World War II to execute complex code-breaking calculations that would have taken any human being an incredibly long time to perform.
This has remained the core function of supercomputers: performing complex and usually repetitive algorithms on huge data sets. Supercomputers have found a home in weather centres worldwide, and although their predictions might not always be as accurate as we might like, they do a far better job than we would be able to do without them! They're also key in more general climate research: without supercomputers, we would probably not have known about global warming. NEC's Earth Simulator was created for precisely this purpose. The amount of data that needs to be processed when considering this global phenomenon is enormous.
Likewise, military problems also often require supercomputers. The current fastest supercomputer in the world - an IBM BlueGene/L, nicknamed Roadrunner, and installed at the Lawrence Livermore National Laboratory in California - works for the US military. Most of its workload is classified, but it is known that much of it involves work on nuclear weapons.
Physics simulation in general is another important application. The ASCI Red was primarily created to provide the level of processing power required for 'full- physics' numerical modelling, where all of the data and physical equations of a system can be used in full. Other scientific applications include chemical and biological molecular analysis, with the latter the particular focus of the Folding@Home project, which turns everyday Internet-connected computers into a supercomputer distributed around the globe. Semiconductor design now also requires the use of supercomputers, so the systems are in effect designing their own future. In order to benefit from supercomputing, problems must contain a considerable amount of parallelism. If a problem can't be split up in this way, it will be a waste to run it on this kind of exotic hardware. Fortunately, some tasks lend themselves naturally to parallelism. These tasks are nicknamed 'embarrassingly parallel', and examples include graphics rendering where each pixel can be calculated separately and brute-force code cracking. Speed limits
Big computers have big problems, however. Thanks to the laws of physics, there is a limit on how fast data can travel: nothing can go faster than light. For a spread-out system, data will take a fairly large amount of time to move from one processing subsystem to another, placing a ceiling over how fast calculations can occur. The continuing reduction in the size of transistors helps to pack more of them into the same space, so the distances between them will become smaller.
But since supercomputers are now made up of clusters of multiprocessor computers, the communications paths between all the different elements have become the most significant bottleneck. Although the processors in today's supercomputers aren't far off what you find in a desktop, the networking fabric connecting them together remains highly specialised. A key difference between AMD's Opteron processors, which are aimed at high-performance computing (HPC) usage, and its Athlon 64s (which are aimed at the desktop) is the number of HyperTransport buses available. These buses allow the processors in a node to collaborate more quickly. Intel's Quick Path Interconnect will perform a similar function when Core i7's higher-end Xeon siblings appear in early 2009.
Then there's the need to network nodes together as fast as possible to make the cluster. This requirement has led to the development of HPC-specific networking technologies. Sun used the Scalable Coherent Interface - which is capable of 20Gbps - for many of its supercomputers in the late 1990s, and saw its share of the TOP500 list grow rapidly as a consequence. But the hunger for ever-increasing network bandwidth is never satiated, leading to the introduction of Infiniband, which can operate at speeds of up to 96Gbps - nearly a hundred times faster than the Gigabit Ethernet used for more general networking. The IBM Roadrunner uses Infiniband to connect its clusters. A 100Gbps version of Ethernet called 100Gbase-X is also under development. Some supercomputer manufacturers have developed their own proprietary interconnect technology. NEC's IXS Super-Switch technology offers a staggering 256Gbps.
Another perennial problem of performance computing is that processing power also requires electrical power. This means that the more of the former you want, the more of the latter you're going to need. IBM's valve-based AN/FSQ-7 of the 1950s required as much as 3 MegaWatts - enough to illuminate a small town. The headline figures haven't diminished much over the years, either, with IBM's Roadrunner requiring 2.35MW at peak - although Roadrunner packs in thousands of processors while its predecessor powered just one.
Closely associated with this hunger for Watts is one of its by-products: heat. Cray tackled this situation from the outset, using liquid cooling achieved with Freon and copper cold plates. The company also developed some other novel cooling systems, such as immersing components in electrically inert but highly heat-conductive fluids. This method was used to cool the Cray-2.
But the problem of cooling supercomputers extends far beyond its main internal components. With MegaWatts of electrical power going in, getting the heat away from the circuitry is just the beginning. The cabinets must be designed with heat dissipation in mind, and the whole architecture of the supercomputer facility must transfer hot air to the outside atmosphere. This generally involves hefty amounts of air conditioning. Some designs have involved elaborate water-cooling pipes worked through the facility itself, although this has fallen out of favour for cost reasons. Either way, taming the thermal problem is likely to consume a significant amount of electrical power. For example, IBM's ASCI White required 3MW to power its computing tasks, but it required an equal amount of power to cool the system while it was running. Meeting the challenges
Most of the fastest computers in the world now use similar processors to those found in your desktop PC and even the latest consoles. Cray's XT Jaguar came close to beating IBM's Roadrunner with a massive array of 45,000 quad-core AMD Opterons. But there is research into new designs that could again increase the power of individual processors by orders of magnitude. For example, CPUs are still resolutely two-dimensional. Since getting data around the various components is a major issue for massively parallel systems, being able to pack transistors on top of each other as well as side by side promises the kind of leap in performance caused by the integrated circuit itself.
In October 2008, the Interuniversity Microelectronics Centre (IMEC) in Belgium announced a breakthrough in 3D stacking, demonstrating working circuits using its 5µm copper through-silicon vias (Cu-TSV) process. Two 130nm wafers were sandwiched on top of each other, with copper lands bonded together using thermocompression. So, in theory, two quad-core processors could be packaged into the space of a single eight-core processor.
In Japan, electronics firm Unisantis is working on a Stacked-Surrounding Gate Transistor (S-SGT) design, which promises to enable chips with clockspeeds between 20GHz and 50GHz. S-SGT is a bit like perpendicular recording in hard disks, with the transistors arranged vertically rather than horizontally. This means that more transistors can be packed into the same space, and it reduces both the effects of some of the unwanted physical properties that are encountered when transistors reach a certain level of miniaturisation (such as gate leakage) and the speed limits caused by how far electrons have to travel from gate to gate.
Initial research is revolving around increasing the density of flash memory, which isn't surprising as Fujio Masuoka - who invented flash memory when he was working at Toshiba - is one of the chief proponents. But benefits are expected across all types of silicon products, including CPUs. Processors hit a clockspeed wall a few years ago, which forced a switch to a parallel multicore approach to boost computing speed instead. But a tenfold increase in frequency would still provide a proportional boost in computing performance.
Optical computers have also been touted as a future replacement for current silicon-based designs. However, photonic transistors would actually require more power than electronic ones. So, in reality, optical computing is not likely to be the future of supercomputing. However, an area where optics do win out is when data rates and distances rise, as less loss of data is incurred compared to electrical lines.
Optical fibre is already the main enabling technology of high-speed telecommunications, and optical Infiniband cabling has been shown to exceed its copper equivalent in performance. Now, optical connections are also starting to be considered for use inside the CPU. In particular, the Optical Shared Memory Supercomputer Interconnect System (OSMOSIS), a joint project of Corning Incorporated and IBM, aims to create a photonic-switching fabric. This would provide high-speed switching and scheduling of all the CPUs in a massive parallel cluster. The most recent results demonstrated the fastest optical packet switch in the world, with an aggregate capacity of 2.5Tbps.
Another promising possibility for the future of a supercomputer CPU comes from a much more organic source: DNA. A demonstration in 2002 by researchers from the Weizmann Institute of Science in Rehovot, Israel, showed off a example of DNA computing that gave a performance of 330 trillion OPS. Even now - six years later - this performance places it fourth in the TOP500 list, and astonishingly, this was achieved with a single DNA molecule. However, the technology is currently very limited in the kind of calculations it can perform, and it can only answer 'yes' or 'no' when asked a question. The system isn't exactly a floating-point cruncher in the manner of traditional supercomputers, and it won't be making its way to a mainframe near you in the near future, but it could well come into its own at some point.
An even more esoteric answer to the problem of building a supercomputer comes from quantum physics. This is still a very new area, but small-scale calculations have been successfully demonstrated using the curious behaviour of matter at the quantum level, in particular entanglement and superposition. With entanglement, two or more objects have linked quantum states, meaning that when one changes, the other performs an identical transformation. Superposition refers to the probabilistic way in which matter behaves at the quantum level. Taken together, these behaviours theoretically would allow quantum computers to perform calculations an order of magnitude quicker than traditional systems. PFLOPS in your lounge
However, it's unlikely that any of these new CPU technologies will be making their way into supercomputing over the next few years. Developing a new and amazing processor design is great for the advancement of technology, but it must be realistic. If the new design is 10 times faster, but a hundred times more expensive than designs derived from mainstream consumer products, then clusters of the latter will have a much more attractive price-performance proposition. This was the main reason why supercomputing hit a brick wall in the early 1990s, a period when many of the former big names were forced into bankruptcy.
Once upon a time, computer technology innovation flowed from the specialised high-end to the generalised consumer. But nowadays volume is a key requirement in order to provide the income necessary for the research and development of a new processor core design. CPUs are designed with mass appeal first to make them financially viable, but with the ability for HPC derivatives of the processor to be made. For example, of the top 10 fastest supercomputers in the world as of November 2008, none used processors that were custom-designed for the purpose. Instead, AMD Opterons, Intel Xeons and IBM PowerPCs dominate, all of which have closely related consumer equivalents.
The benefits of consumer volume for supercomputing don't stop with CPUs. Since vector performance is so important to floating-point computation, the burgeoning speed of graphics cards also promises further massive leaps in supercomputing power, particularly when harnessed by distributed computing. The latter is racking up some rather impressive processing scores. The Folding@Home project had reached a whopping 4.27 PFLOPS by 14 November 2008, making it the fastest supercomputer in the world by a country mile. Most tellingly, over half of this total was contributed by ATI and Nvidia GPUs. But it's also very significant that 1.7 PFLOPS of that total came from Playstation 3 games consoles. In fact, IBM's Roadrunner, which is currently the fastest standalone supercomputer in the world, uses nearly 13,000 cell processors that are closely related in design to the CPU in a Playstation 3.
So the future of supercomputing could be sitting on your desk right now. As the Folding@Home project shows, distributed computing is already capable of achieving greater performance than the fastest standalone machines. Now that more than half the households in the developed world are online, the fabric of the Internet itself may be the future of the fastest computing on the planet. Google certainly seems to think so. Its search engines are estimated to have over 300 TFLOPS at their disposal, and with the company getting into the application outsourcing business, maybe it won't be too long before anyone can have their very own slice of a supercomputer.
I believe that a similar AN/FSQ-7 unit was at Griffiss AFB in Rome, NY in the 50's - my father got his start as an electronic tech replacing vacuum tubes here. He described it as "one floor of computers, one floor of air conditioning. Repeat."
ASCI White, circa 2001
NEC EarthSimulator in Japan
Just a correction: Roadrunner is not a BlueGene and it is installed at LANL not LLNL. The BGL at LLNL is far from the fastest computer in the world now. NNSA is part of DOE not DOD, as the US nuclear weapons program is under civilian control, as it has been for many decades.
Here is a primer on Folding@home, and how the combined 350,000 computers work to make Folding@home the largest supercomputer, albeit a distributed system.
Folding@Home FAQ for new users:
What is Folding@Home? A Stanford University project to find out how proteins fold.
Why it's important: Proteins folding wrong causes all kinds of diseases, like Alzheimer's, Parkinson's, and forms of cancer. Folding@Home uses novel computational methods and large scale distributed computing, to simulate timescales thousands to millions of times longer than previously achieved. Through Folding@home, scientists now have the horsepower to study the mechanics of protein folding. With its ability to share the workload among hundred of thousands of computers economically, Folding@home can help scientists understand how proteins snap, or don't, into their predestined shapes - and may help to explain the origins of diseases such as Alzheimer's and apparently unrelated diseases. We're fueling research that could end all that.
How does it work?: You download a safe, tested program (see link below) that is certified by Stanford University. It gets work from Stanford, runs calculations using your spare computer power, and sends the results back to the University.
Is it safe? Yes! Folding@Home rarely effects computer performance in any way and won't compromise your privacy in any way. It only uses the computing power you aren't using so it doesn't slow down other programs.
How do I get started folding for Team FreeRepublic?:
1.) Download the folding program from Stanford University's folding download page (Folding@home Client Download). Type in your desired user-name.
2.) Type in 36120 for the team number. THIS IS VERY IMPORTANT - if you get the number wrong, you won't be folding for team FreeRepublic!
3.) The third question asks, "Launch automatically at machine startup, installing this as a service?" - We recommend you answer YES. Otherwise you will have to manually start the program after every reboot.
How can my computer help? Even if they were given exclusive access to all of the world's supercomputers, Stanford still wouldn't have as much processing power as they get from the supercluster of people's desktop systems Folding@home relies on. Modern supercomputers are essentially a cluster of hundreds of processors linked by fast networking. But Stanford needed the power of hundreds of thousands of processors, not just hundreds.
There's no reason to not get involved! It's free, easy, and you can know you're helping every minute without lifting a finger.
List of Relevant Folding Links
Why Fold - Watch This !!
Other Useful Stuff - Links
Fahmon Third Party Monitoring Software
Past FreeRepublic Folding threads
That movie, plus the Neptune at Night series on PBS in the 80's, were among my favorites.
I’ve worked on seven from Top500 supercomputers back in the day...
“Ive worked on seven Top500 supercomputers back in the day...”
We have more capability in an iPhone than in some of those older systems, at least in terms of FLOPS, certainly in terms of usability.
If you have recently upgraded a system, please consider reinstalling one of the new F@H consoles (much improved), a F@H tray client (makes it easy to start/stop F@H), one of the Mac/Linux/GPU/SMP folding clients, or run it on your PS3!
Thanks for all of your help to keep us high in the charts.
I’m afraid he’s detected a fault in the AE35 unit.
I cut my teeth as a programmer in the late '60s, and loved the work as there was always something new to learn, almost on a monthly basis. When we talked about the Cray and its power, it was almost in whispers. Major Ju-Ju. When the home computers came on the scene I was pretty blase about their improvements until I read where the average desktop of the day, and this was some years ago, had more power than the Cray. That really blew my mind and made me aware of how far we had come.
And I think there was a similar article that said something like the old 286-12 computers had more power than the computers on the Apollo missions. If so, I really have to tip my hat to those programmers.
Oh hell, your digital watch has more processing power than those old Apollo IBM computers.
I recently added a quad core AMD machine running Vista 64 bit. I was hoping to run the GPU core, but that didn’t work out, I was only getting a few hundred points a day, so I switched to the SMP core.
After some initial problems, it has been clicking along for a few months, but every time I try to install it as a service, the folding gets lost before long and just stops or refuses to upload results, etc, etc. So for now, I’m just running the console version in a window.
I read all I could find on this, but there is so much out of date info around, it is hard to figure out what works and what doesn’t.
Do you (or anyone here) happen to know if the SMP core can successfully run as a service under Vista 64 bit? If it can, I could use a few hints on how to do it. Just answering the questions in the setup program doesn’t seem to work, there must be other steps required.
Or worse, they will yield the false answers WANTED by those who follow their Gaia-inspired AGW religion.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.