Heads up: Fujitsu tips its hand to reveal exascale Arm supercomputer processor

Heads up: Fujitsu tips its hand to reveal exascale Arm supercomputer processor – the A64FX
The Register ^ | 22 August 2018 | Chris Williams

Posted on 08/22/2018 10:51:33 AM PDT by ShadowAce

Hot Chips Fujitsu has unfurled the blueprints for its homegrown high-performance Arm-based processors dubbed A64FX, the brains of its Post-K supercomputer.

The designs were shown on Tuesday at a gathering of semiconductor engineers in Silicon Valley. The Post-K is a 1,000 peta-FLOPS monster – an exascale machine – that will supersede Japan's SPARC64-based K supercomputer. It is due to go online in 2021, and has just completed a round of trials that demonstrated the processors work – to some degree, at least.

Post-K hopes to be the world's fastest publicly known supercomputer by the time it's fully powered up and consuming 30 to 40MW. Today, the top slot is held by the US government's Summit machine that uses IBM POWER9 and Nvidia Volta GV100 processors, along with Mellanox networking gear, to max out at 188 peta-FLOPS.

Crucially, it will be an exascale Arm-compatible supercomputer, a significant milestone for the CPU architecture that's famous for being in practically everyone's phones, hard drives, smart cards, and other embedded electronics, and has dreams of driving laptops and servers.

So what does a Fujitsu-designed supercomputer Arm processor look like? Here's what we learned from Fujitsu's Toshio Yoshida at the Hot Chips engineering conference in Santa Clara: the A64FX has 8.8 billion 7nm FinFET transistors in a package with 594 pins, and 48 CPU cores plus four management cores. Each chip has a total of 32GB of high-bandwidth memory (HBM2), 16 PCIe 3.0 lanes, and a 1024GB/s total memory bandwidth, and hits at least 2.7 tera-FLOPS in terms of performance.

The 52 CPU cores are split into four clusters of 12 main cores plus one management core, each group has 8GB of HBM2 rated at 256GB/s, and 8MB of shared L2 cache. There is cache coherency across the clusters and the whole chip.

The chips are interconnected via Fujitsu's second-generation Tofu mesh-torus-like network. This interconnect can shift data, in and out of each processor chip, via 10 ports each with two lanes maxing out at 28Gbps each.

Caches and access speeds for the Fujitsu A64FX

The A64FX's cache hierarchy and speeds, for the 12 compute cores and management core per cluster, four clusters to a chip ... Source: Fujitsu
Click to enlarge

The CPU cores are 64-bit only – there's no 32-bit mode – and they use the Armv8.2-A instruction set. It supports Arm's 512-bit-wide SIMD scalable vector extension (SVE) that we described in detail, here. It means the chips can crunch vector and matrix calculations in hardware – a must for supercomputer and machine-learning applications. It also supports 16 and 8-bit integer math, as well as the usual floating-point precisions (FP16, 32, and 64), which are useful for AI inference code.

The A64FX is a superscalar, out-of-order execution beast, and first Armv8.2-A design, we're told. Folks who have done 32 and 64-bit Arm assembly programming will know the architecture has fixed-width instructions, typically one operation per instruction, as per the classic RISC school of thinking. Interestingly enough, the A64FX, by implementing SVE, has an instruction prefix for its four-operand fused-multiply-add instruction (FMA4) – an incredibly useful operation – that kinda reminds this vulture of x86 instruction prefixes.

To perform the calculation r0 = r3 + r1 * r2, you use two instructions that are merged into one at the pre-decode stage, and are performed in one step despite being fetched as two instructions. These are:

movprfx r0, r3      ; prefix next instruction
fma3 r0, r0, r1, r2 ; r0 = r3 + r1 * r2, the r3 substituted in

Each CPU core's execution unit can handle two 512-bit SIMD operations at once. The input data is packed into 512 bits and crunched in one go – like Intel's AVX512 operations on its server parts. So you could feed in four 8-bit values, four corresponding 8-bit coefficients or weights, which are multiplied to get four answers then added to a 32-bit offset, and written out to a register.

Fujitsu reckons its A64FX can hit 21.6 TOPS (trillion or tera operations per second) when doing 8-bit integer math; 10.8 TOPS with 16-bit integers; 5.4 TOPS with 32-bit; and 2.7 TOPS with 64-bit, all when performing integer SIMD. Overall, it's said the A64FX is at least 2.5 faster than Fujitsu's previous supercomputer processor – the SPARC64 XIfx – at HPC and AI work.

Nvidia's P4 and P40 accelerators for servers clock in at 22 and 47 TOPS with 8-bit integer, for what it's worth.

The L1 cache has a combined gathering mechanism that can fetch consecutive elements in arrays and copy them into a register. So, for example, you could use this to hoover up eight bytes spread over memory into one 64-bit register, each byte slotted into its own byte position in the register. The per-core four-way 64KB L1 data cache is read by the instruction engines at 230GB/s, and written back at 115GB/s. The L2 shared cache feeds data in at 115G/s, and receives at 57GB/s.

Pipeline stages of the A64FX ... Source: Fujitsu
Click to enlarge

Per-chip power usage is monitored and controlled on a per-millisecond basis, and down to the nanosecond per-core. Fujitsu claims its A64FX has mainframe-grade resilience, with ECC or duplication on all caches, parity checks within the execution units, instructions are retried if something is detected as going wrong, error recovery on the Tofu interconnect links, and 128,000 error checkers in total on the chip.

The whole shebang runs Linux, with a Lustre-based distributed file system and non-volatile memory for accelerating file input-output. The toolchain supports C, C++ and Fortran compilers, MPI, OpenMP, debuggers, and other utilities and languages.

You'll note there are no third-party accelerators: it's pure Arm, Fujitsu's way. The aim is to design a chip that runs supercomputer-style applications – simulations, science experiment analysis, machine learning, and other number crunching – with a higher performance-per-watt than general-purpose CPUs.

Yoshida didn't want to talk about clock speeds and individual chip power usage just yet, sadly. The machine is still years away from being finished, and all the specifications and implementation details have yet to be nailed down or revealed. "We will continue to develop Arm processors," he told the conference, though. Despite its delays, Fujitsu hasn't been put off Arm big iron.

And, er, yes, you might be able to play Crysis on it. ®

TOPICS: Computers/Internet
KEYWORDS: arm; supercomputer

Navigation: use the links below to view more comments.
first 1-20, 21-32 next last

1 posted on 08/22/2018 10:51:34 AM PDT by ShadowAce

[ Post Reply | Private Reply | View Replies]

To: rdb3; Calvinist_Dark_Lord; JosephW; Only1choice____Freedom; amigatec; Ernest_at_the_Beach; ...

2 posted on 08/22/2018 10:51:52 AM PDT by ShadowAce (Linux - The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

This stuff left me in the dust back in the Dual Core days.

3 posted on 08/22/2018 10:53:47 AM PDT by Rebelbase (Consensus isn't science.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Can we link to our 3-D printers? I’m itchin’ to print me up a 12 guage street sweeper jojo biden model. ;-)

4 posted on 08/22/2018 11:02:18 AM PDT by rktman (Enlisted in the Navy in '67 to protect folks rights to strip my rights. WTH?)

[ Post Reply | Private Reply | To 1 | View Replies]

To: Rebelbase

It will never work, no punch card interfaces.

5 posted on 08/22/2018 11:02:26 AM PDT by ImJustAnotherOkie (All I know is what I read in the papers.)

[ Post Reply | Private Reply | To 3 | View Replies]

To: ShadowAce

>>The toolchain supports C, C++ and Fortran compilers,<<

Fortran!!! Whoo Hoo!!! A MAN’S language.

Those who hate the printf and variants in other languages can thank Fortran for FORMAT statements.

And one of that sissy WatFor either!!

6 posted on 08/22/2018 11:05:05 AM PDT by freedumb2003 (As always IMHO)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Jeff Bezos’ kid wants one to play the latest games on.

7 posted on 08/22/2018 11:17:03 AM PDT by 2ndDivisionVet (You cannot invade the mainland US. There'd be a rifle behind every blade of grass.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

OOOOH!........FASTER pr0N!............

8 posted on 08/22/2018 11:17:17 AM PDT by Red Badger (July 2018 - the month the world learns the TRUTH......Q Anon.......Timelines change. Aug 16)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

You would think with all these super-fast computers, cancer would be cured by now, all crimes would be solved, and they can predict earthquakes before they happen, and predict extreme weather sooner.

The old mantra still applies....Garbage in—>into faster computer—>garbage out faster.

9 posted on 08/22/2018 11:25:35 AM PDT by entropy12 (Trump/Pence 2020)

[ Post Reply | Private Reply | To 1 | View Replies]

To: freedumb2003

I used Fortran for 23 years, then came along dbase 5. If I had dbase 23 years sooner I could have developed more expert software for custom designed 1000 HP machines, in half the time.

10 posted on 08/22/2018 11:28:33 AM PDT by entropy12 (Trump/Pence 2020)

[ Post Reply | Private Reply | To 6 | View Replies]

To: ShadowAce

The question is, when will someone invent the unhackable computer immune to any and all malware/virus etc.?

11 posted on 08/22/2018 11:32:26 AM PDT by 1Old Pro

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

How many quarters will it take?

-PJ

12 posted on 08/22/2018 11:36:33 AM PDT by Political Junkie Too (The 1st Amendment gives the People the right to a free press, not CNN the right to the 1st question.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

If it cannot run DOS it is of no use to me..

13 posted on 08/22/2018 11:36:44 AM PDT by SuperLuminal (Where is another agitator for republicanism like Sam Adams when we need him?)

[ Post Reply | Private Reply | To 1 | View Replies]

To: entropy12

DBase was ahead of its time.

14 posted on 08/22/2018 11:39:13 AM PDT by freedumb2003 (As always IMHO)

[ Post Reply | Private Reply | To 10 | View Replies]

To: freedumb2003

The main feature dBase has over Fortran is ability to load data much more easily in data files, and ability to massage the data manually without having to write an executable.

Every time I modified data in a Fortran program, I had to RECOMPLILE the entire program! In dBase I could delete or add data without having to recompile.

15 posted on 08/22/2018 11:47:08 AM PDT by entropy12 (Trump/Pence 2020)

[ Post Reply | Private Reply | To 14 | View Replies]

To: ShadowAce

Good, Battlefield V will be out soon

16 posted on 08/22/2018 11:48:07 AM PDT by AppyPappy (Don't mistake your dorm political discussions with the desires of the nation)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Just imagine how fast it will bounce the cards in Solitaire!

Oh wait. It is apparently only designed for real operating systems...

Seriously though, this looks really freaking interesting.

17 posted on 08/22/2018 11:48:13 AM PDT by zeugma (Power without accountability is fertilizer for tyranny.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: 1Old Pro

“The question is, when will someone invent the unhackable computer immune to any and all malware/virus etc.?”

Linux machine that is not connected to a network. Boom! Your prayers are answered.

18 posted on 08/22/2018 11:50:22 AM PDT by AppyPappy (Don't mistake your dorm political discussions with the desires of the nation)

[ Post Reply | Private Reply | To 11 | View Replies]

To: ShadowAce

If its on Samsung 7nm process, its not yet ready for prime time.

19 posted on 08/22/2018 11:51:28 AM PDT by Zathras

[ Post Reply | Private Reply | To 1 | View Replies]

To: entropy12

20 year COBOL developer.

I remember that. Just like storing data in binary or packed format to save storage.

Interpretive languages are great — but sadly when they don’t require strong typing developers get lazy.

I ALWAYS declare no matter what the language.

20 posted on 08/22/2018 11:53:44 AM PDT by freedumb2003 (As always IMHO)

[ Post Reply | Private Reply | To 15 | View Replies]

Navigation: use the links below to view more comments.
first 1-20, 21-32 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794