The hidden currents powering Intel's next gen chips ~~ Another major change coming????

If it was just a Pentium M variant I don’t think there’d be such a fuss about it. Intel is portraying this as the biggest change since the original P4, yet there have been several new cores introduced since then including the Pentium M itself. No, this change is bigger.

The change is so big in fact, it’s the reason for Apple’s processor switch. Indeed the phrase given when Steve Jobs announced the switch, "performance per watt" is the very same phrase being used by Intel spokesmen.

All we know is it's going to be a multi-core, it's also going to be 64 bit and support hyper threading. The problem is trying to do all this at the same isn’t going to reduce power consumption, in fact doing all this means power consumption is more likely to increase.

There are ways to decrease power consumption but many of these seem to have been already used in the Pentium M series, they can go further but IBM has already gone beyond this in the Cell and XBox360’s PowerPC cores. Perhaps Intel is planning something rather more radical.

The only hint is some comments from Intel apparently saying the processor will be “structurally different” but will have no problems running the same apps. When has Intel ever had to say this? It can normally be assumed a new core will run the same apps - unless of course, it’s radically different.

According to the Apple announcement, the reason it is switching is "performance per watt". Steve Jobs showed a graph with PowerPC projected at 15 computation units per watts and Intel’s projected at 70 units per watt. Intel must have figured out a way to reduce power consumption 4 fold. How? Can this even be done?

Yes, it can be done but it requires striking changes in the processor design. The forthcoming Cell processor’s SPEs at 3.2 GHz use just two to three Watts and yet are said to be just as fast as any desktop processor. I think we can safely assume a future Intel device will not use SPEs instead of x86 processors but they could use some of the same techniques to bring the power consumption down.

Modern microprocessors throw millions of transistors at producing increasingly small performance boosts. The SPEs’ designers didn’t do this, they only used transistors if they could be shown to produce a large performance boost. The result is in essence the antithesis of modern microprocessor design, the SPEs are very simple with a relatively short pipeline, strictly in-order execution and no branch prediction.

An extremely stripped back x86 design can and has been done but performance doesn’t so much suffer as gets tortured to death. Out of order execution seems to be pretty critical to x86 performance, most likely due to the small number of architectural registers. Then there is the x86 instruction decoder which on simple processors takes up a significant amount of room and of course consumes power. Even the stripped back designs can’t remove this.

However, there was one company which took a more radical approach and while its processor wasn’t exactly blazing fast it was faster than those using the stripped back approach, what’s more it didn’t include the x86 instruction decoder. That company was Transmeta and its line of processors weren’t x86 at all, they were VLIW (Very Long Instruction Word) processors which used "code morphing" software to translate the x86 instructions into their own VLIW instruction set.

Transmeta, however, made mistakes. During execution, its code morphing software would have to keep jumping in to translate the x86 instructions into their VLIW instruction set. The translation code had to be loaded into the CPU from memory and this took up considerable processor time lowering the CPU’s potential performance. It could have solved this with additional cache or even a second core but keeping costs down was evidently more important. The important thing is Transmeta proved it could be done, the technique just needs perfecting.

Intel on the other hand can and do build multicore processors and have no hesitation in throwing on huge dollops of cache. The Itanium line, also VLIW, includes processors with a whopping 9MB of cache. Intel can solve the performance problems Transmeta had because this new processor is designed to have multiple cores and while it may not have 9MB it certainly will have several megabytes of cache.

Intel likes to call its technique "EPIC" instead of VLIW but it’s the same thing really.

Intel can make a VLIW processor with a large number of small, low power cores and devote one or more of these to translating x86 to the VLIW ISA, they will partly hold the translation software in the bigger cache so it’ll rarely need to hit RAM. It could even do this with a dedicated thread per core but that’ll need a big shared cache.

Intel has a lot of experience of VLIW processors from its Itanium project which has now been going on for more than a decade. Intel also now has HP’s expertise on board as HP’s entire Itanium design team was recently transferred to Intel.

Another technology Intel has access to is DEC’s FX!32. This was written in the mid 1990s and allowed X86 software to run on Alpha RISC microprocessors. A lot of the Alpha people and technology was transferred to Intel and FX!32 most likely went with it, indeed it has already been developing similar technology to run X86 binaries on Itanium for quite some time now.

It gets better. Both the Itanium and the Transmeta designs were said to be inspired by VLIW designs built in Russia by a company called Elbrus. Intel did a deal with Elbrus in mid 2004 then went on to buy the company in August 2004. The exact nature of the deal is unclear, however, as another company continued and taped out the E2K processor earlier this year.

Most interestingly though is the E2K compiler technology which allows it to run X86 software. This is exactly the sort of technology Intel need and since last year they have had access to it and employ many of it’s designers.

So, Intel has access to VLIW technology from the Itanium and HP as well as the translation software from DEC. Most importantly it has the highly advanced technology from Elbrus which has been in development since the 1980s.

To reduce power you need to reduce the number of transistors, especially ones which don’t provide a large performance boost. Switching to VLIW means they can immediately cut out the hefty X86 decoders.

Out of order hardware will go with it as they are huge, consumes masses of power and in VLIW designs are completely unnecessary. The branch predictors may also go on a diet or even get removed completely as the Elbrus compiler can handle even complex branches.

With the X86 baggage gone the hardware can be radically simplified - the limited architectural registers of the x86 will no longer be a limiting factor. Intel could use a design with a single large register file covering integer, floating point and even SSE, 128 x 64 bit registers sounds reasonable (SSE registers could map to 2 x 64 bit registers).

Rumours suggesting the cores will be four issue wide sound perfectly reasonable for a VLIW processor. At least two (Hyper)threads will almost certainly be supported but more would require more registers not to mention giving them something of a naming problem - Ultra- hyper-threading?

You can of course expect all these cores to support 64 bit processing and SSE3, you can also expect there to be lots of them. Intel’s current Dothan cores are already tiny but VLIW cores without out of order execution or the large, complex, x86 decoders leave a very small, very low power core. Intel will be able to make processors stuffed to the gills with cores like this.

One interesting aspect of an architecture like this is it gives Intel the ability to learn from it and change it in a way X86 never could.

Changing the basic X86 design would lead to all sorts of difficulties with compatibility so instead, over the years more and more has been added and little if anything removed.

Intel will now be free to do as it pleases with X86 decoding done in software Intel can change the hardware at will. If the processor is weak in a specific area the next generation can be modified without worrying about backwards compatibility. Apart from the speedup nobody will notice the difference. It could even use different types of cores on the same chip for different types of problems.

One thing I do not expect is the new core to be an Itanium derivative, it was not designed for low power. Building a new ISA gives Intel a chance to learn the lessons of the sometimes erratic performance of the Itanium. Not that we’ll see the new ISA, this will be hidden from developers underneath the software translation layer. A variant of this device could end up badged as an Itanium though, the software translation should have no trouble converting one VLIW variant to another.

Like the Transmeta devices, software will not run at it’s full potential until it’s been fully translated, you can pretty much bet Intel will make sure third party bench-markers will be made well aware of this. I suspect we may also see speculative translation running in the background so everything gets translated and saved as soon as possible. Once translated, the new binaries are saved to disc, they will run as native VLIW thereafter.

The forte of this processor will be multithreaded code and multitasking. If you are doing lots of things at one you’ll be well happy, servers in particular will benefit from this approach. Multitasking will benefit because different cores will get different tasks, a user switching between them will not cause them to halt so responsiveness of systems with this processor will be very good.

Single threaded performance on the other hand could be relatively weak although that’s not a given, I expect AMD will hold on to its crown in single threaded performance for now.

Based on the various comments and actions of Intel, as well as other companies, I think Intel is preparing to announce a completely new VLIW processor which uses software to decode x86 instructions and order their execution. It might be relatively weak on single threaded code but it’ll more than make up for it in numbers, heavily multithreaded code should run very nicely indeed.

We’ll see shortly if my speculation is correct however, multiple processor vendors are already going in the same direction with a large number of simple cores. X86 hardware implementations don’t lend themselves to the simplicity required for large multicore devices, a VLIW approach has already been shown to be workable whilst reducing both power consumption and size.

Historically, Intel has often used new techniques after it's been used by other vendors. Its real strength is taking those ideas, improving them then mass manufacturing them.

I expect Intel will apply its full manufacturing skills to this device - this processor could have as many as 16 cores.

To date, Apple’s CPU switch to Intel has prompted a lot of speculation about the real reason as frankly, it didn’t made much sense. But if this speculation turns out to be true, reasons behind Apple’s switch are obvious. µ

And there is this:

*************************************************

Intel 45 nanometre process is good to go

Leakage problem solved

By: Thursday 18 August 2005, 13:54

EARLIER WE POINTED OUT that the 45nm shrink of Merom is going to be called Penryn, the Moldavian term for moldy Apple(1). The interesting thing is not about the chip itself, but the process. Since the 65nm process is on the verge of release, it means that this coming IDF, they will probably start talking about the 45nm process.

Think happy thoughts here people, from what several sources have told the INQ, the leakage problem is solved, and I mean solved, not lessened. This will be a massive gain for Intel, and unless AMD and IBM can match it, it will pretty much hand it the mobile space, not to mention anything else where power matters.

From what we have been told, the 65nm process is better than 90nm in leakage, but it is an advance, not an answer by any means. Sadly, the process breakthroughs can't be backported to 65 in a way that would do the same there. There are some other 45nm breakthroughs, and I am not using that word in lightly, in the yield area that will get sucked into 65nm. Expect improvements on this process over time, and then a huge leap at 45.

What does this mean? It means that until 45nm, Yonah will survive, the power draw of Merom is a little high for most low power laptop configs. Merom will get to 9W in a ULV dual core part, but Yonah is down at 5.5W, single core though. When Penryn hits, expect the power draw on the low end to go way down, and Yonah to be laid to rest.

So, if you hear gushingly good things about 45nm coming from IDF, believe it. If you hear anyone pooh-poohing Intel and its process tech because of the debacle that was 90nm, just point and laugh. This one will be very very good. µ

(1) I completely made that up, but do read a little into it.

Unified multimodal configurations have led to many compelling advances, including DNS and Byzantine fault tolerance. After years of significant research into IPv6, we demonstrate the exploration of Internet QoS. This is a direct result of the analysis of expert systems. To what extent can congestion control be investigated to solve this grand challenge?

Another practical grand challenge in this area is the deployment of the memory bus. The basic tenet of this method is the evaluation of Internet QoS. The basic tenet of this method is the refinement of consistent hashing. Thusly, we use optimal algorithms to validate that 802.11 mesh networks and SCSI disks can interfere to achieve this purpose.

Here, we explore a methodology for modular technology (Pris), which we use to prove that expert systems and public-private key pairs are mostly incompatible. The basic tenet of this approach is the exploration of Moore's Law. To put this in perspective, consider the fact that foremost security experts entirely use DHTs to solve this problem. This combination of properties has not yet been emulated in previous work.

This work presents two advances above existing work. Primarily, we concentrate our efforts on verifying that suffix trees can be made "fuzzy", metamorphic, and game-theoretic. Furthermore, we describe an analysis of congestion control (Pris), disproving that journaling file systems and Smalltalk are generally incompatible.

The rest of this paper is organized as follows. For starters, we motivate the need for the UNIVAC computer. Similarly, we confirm the understanding of linked lists. As a result, we conclude.

Will Longhorn support it...might be the next question....

Well - I have no special knowledge of things Microsoft. I work the Unix, now Linux, side of things.

But I would have to assume that some good programmers are working hard at seeing to it that Windows runs well on multiprocessors.

And for the operating system, that's the main change here. The compiler folks get to sweat the radical change in instruction set architecture, decoding, performance and optimization details in the register, data paths and primary caches. The operating system folks (I guess that Longhorn is in essence an operating system, in addition to its large role as a marketing, bundling and competitive tool) see the multiple processors and complex memory, bus and secondary cache hierarchies.

Obviously the transition to mass produced multiple processors, cores, threads and such is the main event affecting operating systems this decade.

For someone who got into computers almost 30 years ago now, primarily to work on the issues of software coping with multiple processors, it's a good time to be around.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.