[SLUG] Modern x86 ISAs, Extensions and Platforms -- WAS: Overuse of SSE ...

From: Bryan J. Smith (b.j.smith@ieee.org)
Date: Mon Oct 11 2004 - 13:27:37 EDT


On Mon, 2004-10-11 at 12:03, Chad Perrin wrote:
> I'm probably opening a can of worms I shouldn't by asking this,

Yes, to someone who specialized in computer architecture (BSCpE, UCF
'97) and spent 20 months in the semiconductor industry, yes. ;-ppp

> but: What is SSE? I know what the FPU is, but you lost me with SSE.

First understand there are Instruction Set Architectures (ISA), which
are how the processor actually handles instructions, and then there are
"extensions" to the ISA. MMX, 3DNow! and SSE are "extensions." i486,
i586, i686 and RISC86 are ISAs.

Streaming SIMD (Single Instruction, Multiple Data) Extensions. It's
basically Intel's 2nd set of instructions after its initial Math Matrix
eXtensions (MMX). Here's the x86 extension lineage:

Intel MMX (Math Matrix eXtensions)
  Largely integer-only (useful for audio, 2D images, etc...)
P55C (Pentium MMX): i586 ISA, 2-issue ALU, 1-issue FPU
  A few improvements in ALU over massively buggy P54C ALU
P6-2 (Pentium 2): i686 ISA, 2-issue ALU, 2-issue FPU
  PPro (i686 ISA) superior ALU to Pentium, adds MMX
Nx686 (AMD K6 series): RISC86/i686 ISA, 3-issue ALU, 1-issue FPU
  Microcoded MMX to take advantage of unused ALU pipes

AMD 3DNow!
  Largely single-precision, some double-precision floating point
Nx686 (AMD K6 series): i686 ISA, 3-issue ALU, 1-issue FPU
  Addressed Pentium "de-optimizations" (like the ALU LOAD fix)
K7 (AMD Athlon): RISC86/i686 ISA, 3-issue ALU, 3-issue FPU
  Allowed more efficient use of unused FPU/registers pipes
K8 (AMD A64/Opteron): As K7, with dedicated XMM registers
  As K7 with augmented register renaming/out-of-order execution

Intel SSE and SSE2 (Streaming SIMD Extensions)
  Largely single and double-precision floating point
P6-3 (Pentium 3): 2-issue ALU, 2-issue FPU, 1-issue SSE
  Dedicated SSE pipe, designed for speed, not accuracy
P6-4 (Pentium 4): 2-issue ALU, 2-issue FPU, 2-issue SSE
  Added another pipe, SSE2 improvements, same design issue
K7 (AMD Athlon): RISC86/i686, 3-issue ALU, 3-issue FPU
  SSE/SSE2 Microcoded to use unused FPU/registers pipes
K8 (AMD A64/Opteron: As K7, with dedicated XMM registers
  As K7 with augmented register renaming/out-of-order execution

Intel Superscalar Generation 1: Pentium
Scalable from 60MHz - 300MHz

Pentium (i586), 4-issue (2 ALU, 1 FPU), released in 1992, was the first
ever released superscalar x86 processor. It had lots of bugs,
especially in its ALU. The ALU was so bad that idSoftware found it was
faster to load 2x 32-bit integers into the new, pipelined FPU registers,
and then issue several operations to move them into the ALU registers,
than to directly load them into the ALU. Such Pentium "hacks" or
"optimizations" were commonplace. Sadly enough, these optimizations not
only hurt clone x86 processors, but even Intel's own Pentium Pro
(i686). That's why i586 optimizations are _not_ recommended for modern
processors, but i686 instead. When MMX was added, Intel improved some
of the ALU design of the Pentium, but not

Intel Superscalar Generation 2: Pentium Pro - Pentium 3
Scalable from 200MHz - 1GHz, upto 1.5GHz with async hacks

Pentium Pro (i686), 7-issue (2 ALU, 2 FPU), released later in 1994, was
Intel's _final_and_last_ major x86 design -- a full 7-issue design (2
ALU, 2 FPU). All _current_ Pentium processors are based on it (see
below). It fixed a lot of Pentium bugs, and improved on the ALU
significantly, bringing it closer to traditional RISC performance
(especially in the FPU). The Pentium 2 added MMX as well.
Unfortunately, the 2nd FPU pipeline really isn't. It split out the
logic so if the 1st pipeline is doing an ADD, the 2nd pipeline can. The
second the 1st pipeline does a complex FPU instruction, the 2nd one
can't do anything.

The Pentium 3 (i686), 8-issue (2 ALU, 2 FPU, 1 SSE) was a "stop-gap"
redesign of the Pentium Pro/Pentium 2. Despite the "Coppermine"
codename, it did _not_ use Copper interconnects. It added a new,
specialized FPU pipe for its new SSE instructions. The pipe is very
lossy (designed purposely to make errors for speed), and must be
specifically targetted with extensions (so Intel avoided having to
redesign the ALU+FPU core).

Intel Superscalar Generation 2 Stage Redesign: Pentium 4, "Yamhill"
Scalable from 1GHz - 5GHz

The EPIC/Predication CS approach in Intel IA-64 _failed_ to work
efficiently in silicon, must as Digital Semiconductor had publicly
predicted it would. So with Itanium unable to even perform well at its
own instructions (let alone x86 compatibility), Intel had to redesign
its aging P6 core which had trouble scaling much beyond 1GHz.

The Pentium 4 (i686 ISA), 9-issue (2 ALU, 2 FPU, 2 SSE) was a simple
stage-extension redesign that took only 18 months (normally a core
redesign is 18 months plus). It's stages are now over 40, compared to
the PPro-P3's sub-20.

This has introduced several issues. One is that the Pentium 4
performance of its ALU is about 33% slower than the P3, MHz for MHz, so
a 1.5GHz P4 is like a 1GHz P3. The FPU took an even greater hit, 50%
slower than the P3. To combat this, Intel introduced a 2nd SSE
pipeline, with new SSE2 extensions that make better use of the dual-SSE
pipes. It also released new, SSE optimized compilers and added the
optimizations to GCC as well. Unfortunately, the SSE has nowhere near
the accuracy of an FPU. I've met several Intel engineers who worked on
the P4 that have openly complained that Intel would have been far better
off spending the extra 18 months redesigning the core than introducing
yet a new set half-baked extensions.

The other is in control. There is no register renaming. The run-time
optimizations are minimal. And the branch predictor unit is one of the
worst in the x86 business. These were things that were supposed to be
eliminated in IA-64's EPIC-Predication approach, but failed (hence IA-64
is being retrofitted with newly acquired Alpha 364 designs for register
renaming, out-of-order execution and branch prediction). So when the P4
mis-predicts a branch (about 6% of the time), the entire chip must be
flushed and stalls.

One workaround Intel came up with for this was Symmetric HyperThreading
(SMT). It allows two independent instruction streams controlled by the
OS, not the chip, to handle scheduling. Although it adds overhead at
both the chip and OS level, it does allow better scheduling of the pipes
and only half of the chip to "stall" on a branch mis-predict.

"Yamhill" was another quick redesign of the P4 to add some x86-64
instructions. It does _not_, however, feature _key_ x86-64 units like
the I/O Memory Management Unit (MMU). But such features would be
useless on the existing 32/36-bit AGTL+ "Front-Side Bottleneck"
implementation of the current P4/Xeon platform anyway.

[ SIDE NOTE: I have personally predicted that "Yamhill" is a 2 part
project. The first has been realized. The second will adopt Yamhill
for the 50-bit Infiniband/Itanium platform, and add _full_ x86-64
capability. ]

AMD Generation 0: Dead (original K5)
Scalable only to 110MHZ
AMD Superscalar Generation 1: K5, K6
Scalable from 120-600MHz

The NexGen Nx586, released in 1994, was the second superscalar x86
processor. It was a 3-issue ALU, one more than the Pentium, 4-issue
total, no FPU. It used a new ISA known as RISC86 -- which breaks down
variable 8-bit to 144-bit x86 instructions into fixed sized 32-bit RISC
that is almost totally devoid of microcode. Taking almost 8 years to
design, it only ran i386 instructions with some i486 support. It also
lacked a FPU. NexGen later added a non-pipelined FPU in the Nx586-FP,
which actually performed better at instructions than the Pentium's FPU,
unless lots of instructions were staged, along with full i486/TLB
support.

AMD bought NexGen and the Nx586-FP became later instances of the K5
(earlier K5s could not scale past 100MHz). These K5's were rated at
rating higher than actual speed. AMD then released the revised Nx686
with full Intel Pentium Pro (i686 ISA) and MMX support -- far in advance
of Cyrix or IDT (who were still shipping an i486 ISA as late as several
years ago). The Nx686 sported probably the greatest branch predictor
unit ever designed (mega-overkill), which AMD scaled back in future
releases. It also was the first CISC processor to sport full register
renaming and out-of-order execution, things that you'd only find in
RISC. The FPU was still not pipelined, but it was very fast when given
instructions.

Latter releases of the K6 series, the K6-2, K6-3, K6-2+, added 3DNow!
which was AMD's extensions that added floating point versus just
integer. These _pre-dated_ Intel's introduction of SSE (of which, SSE
"borrowed" some instructions, just like Intel IA-32e/EM64T now "borrows"
AMD x86-64).

AMD Superscalar Generation 2: Athlon, Athlon64/Opteron
Scalable from 500MHz - 3GHz

With the combined resources of newly acquired NexGen and Digital
Semiconductor talent, AMD set out to design an extremely flexible
instance of the RISC86 architecture. Most industry experts predicted
failure, but AMD proved otherwise.

>From the get-go, Athlon (i686 ISA) was a 9-issue architecture, 3 ALUs
(same as before, why change what already kicks Intel butt) with 3 FPUs,
fully pipelined. It had staging similar to the PPro-P3 (sub-20), but
sported things like register renaming (far more registers added), far
more capable out-of-order execution and a branch prediction unit that
was more balanced (the K6's was overkill, for 5x as many transistors)
while keeping the state of half the pipes in a mis-predict. It competed
very well with the P3 ALU/FPU, and ravaged the P4 ALU/FPU once it was
released, while scaling far better than the P3 at less power usage than
the P4, MHz for MHz.

Probably the "hair puller" for Intel was the Athlon's 3-issue FPU. It
could do 2 complex _plus_ 1 ADD/MULT _simultaneously_. That meant it
could _surpass_ the capabilities of the P3's 2 FPU + 1 SSE, especially
since the Athlon could do (3) MULT simultaneously (very heavily used in
matrix operations), and the P3 could only do (1) MULT (which it
considers "complex") in the FPU and (1) "lossy" MULT in the SSE pipe.
Hence why the P4 was introduced with another SSE pipe.

But it got worse for Intel. Whenever Intel introduces more SSE
instructions, it has to modify some of the SSE pipes. Not for AMD. It
merely breaks out new microcode to do them in its unused FPU pipes! In
fact, with register renaming built-in to the core, it's cake for AMD!
All the meanwhile, you are getting _true_ FPU accuracy of the AMD's
3-issue x87 FPU unit. Talk about driving Intel nuts!

The Athlon also sported a new platform, the EV6 from the Alpha 264.
Unlike Intel "bus" SMP, EV6 is a 16-point crossbar MP. AMD never took
it beyond dual-processor, but it could have. Alpha 264, which used the
same interconnect, was available in up to 14-way designs. Because EV6
_was_ a 64-bit interconnect, it sported _true_ 40-bit addressing.
Furthermore, because the EV6 interconnect as a "crossbar," transfers
between memory and CPU and I/O and CPU could be independent. This
presented a challenge for AMD.

Because of this, the I/O management portion of the chipset was moved
onto the Athlon MP processor itself. This 40-bit unit that handled the
functionality of the "AGPGart" was really a basic I/O memory management
unit (MMU). So even the 32-bit Athlon was better as a >4GB platform
than Xeon, _if_ you have A) a BIOS that supported "Linux" as a memory
option, and B) you loaded a Linux/x86 kernel that supported EV6/40-bit
addressing on the Athlon.

[ SIDE NOTE: The infamous 4M paging bug of the Athlon was due to the
fact that this unit only support 4K paging. The 4M paging mode was
enabled by Microsoft because it improved performance on Pentium
systems. It does nothing for Athlon, because its 4K paging is already
superior, but it did introduce instability with AGP. ]

64-bit Athlon64/Opteron is actually an extension of 32-bit Athlon. In
addition to adding 64-bit extensions to registers, and a new set of XMM
registers for renaming (largely for SSE -- even code from Intel's own
optimizing compilers seem to excel on Athlon64/Opteron thanx to XMM
renaming), the underlying platform is totally changed. The 40-bit EV6
gives way to a 48-bit multi-point _per_ processsor 48-bit (maximum
allowable for i486 TLB compatibility) platform.

The major changes include another unit for a formal I/O MMU, which ties
into its _directly_, _multiple_ front-side busses -- from 2 to 5,
depending on the model. Athlon64/Opteron processors sport a "total
aggregate front-side throughput" of _at_least_ 12.8GBps/processor, and
up to 25.6GBps/processor -- compared to Intel only offering a maximum of
8.4GBps/processor with the latest DDR2 memory versions/chipsets.

The new NUMA/HyperTransport model requires the I/O MMU, offering not
only "processor affinity" for programs/memory, but I/O as well.
For more on how NUMA/HT works for servers, see the new November 2004
issue of "Sys Admin" magazine (the #1 UNIX/Linux mag in print
circulation):

"Dissecting PC Server Performance"
  http://www.sysadminmag.com/current/

-- 
Bryan J. Smith                                  b.j.smith@ieee.org 
------------------------------------------------------------------ 
"Communities don't have rights. Only individuals in the community
 have rights. ... That idea of community rights is firmly rooted
 in the 'Communist Manifesto.'" -- Michael Badnarik

----------------------------------------------------------------------- This list is provided as an unmoderated internet service by Networked Knowledge Systems (NKS). Views and opinions expressed in messages posted are those of the author and do not necessarily reflect the official policy or position of NKS or any of its employees.



This archive was generated by hypermail 2.1.3 : Fri Aug 01 2014 - 19:36:03 EDT