EV6, EV67, EV68C, EV68A
Although 21264 (EV6) processor was developed by DEC, and was mentioned first during Microprocessor Forum in October 1996, the final silicon implementations were completed only in February 1998, when DEC’s liquidation was already in full swing. The processor itself was a significant step forward compared to EV5, revolutionary in many aspects. One of the most important innovations was out-of-order execution, which implied a fundamental core redesign, and lowered dependence of the functional units on the cache and main memory bandwidth. EV6 could reorder up to 80 instructions on the fly, and that was much more than other competitive products could offer (say, Intel P6 architecture utilized out-of-order execution for up to 40 [micro-commands], HP PA-8x00 - up to 56, MIPS R12000 - up to 48, IBM Power3 - up to 32, and PowerPC G4 - up to 5; Sun UltraSPARC II didn't support instruction reordering at all). Out-order-execution was accompanied with register renaming technique, so there were 48 integer and 40 floating-point additional physical registers implemented (the number of logical registers, also referred to as programmable, remained unchanged).
The number of integer pipelines was increased to 4 (organized in 2 clusters), but they were somewhat different functionally: the 2nd pipeline could multiply (7 clocks per instruction) and shift (1 clock), the 4th could execute MVI code (3 clocks) and shift. Besides, all 4 pipelines supported elementary arithmetical and logical operations (1 clock). Every cluster featured an integer register file of its own (80 entries, like mentioned above), but they were identical (synchronized). The 1st and the 3rd pipelines also handled some tasks of the A-box, by calculating virtual addresses for load/store instructions. A-box itself worked with I-TLB and D-TLB (128 entries each), load and store queues (32 commands each), and 8 64-byte buffers (miss address file) for operations with B-cache and main memory. Floating-point pipelines were also functionally different: the 1st supported addition (4 clocks), division (12 clocks for single-precision and 15 clocks for double-precision), square root calculation (15 and 30 clocks), but the 2nd was only capable of multiplying (4 clocks). By the way, the square root calculation unit and all corresponding instructions were new to Alpha architecture. Just like in EV5, the decoder submitted up to 4 instructions per clock, and the scheduler distributed them between 2 queues: to integer pipelines (I-queue, 20 commands), and floating-point pipelines (F-queue, 15 commands). Besides the square root calculations, they also introduced prefetch instructions and commands for data transfer between integer and floating-point registers.
C-box was redesigned significantly: now it supported only 2 cache levels. The on-die L1 consisted of 64KB I-cache and 64KB D-cache, both 2-way set associative and with 64-byte lines. D-cache was write-back, though still was duplicated in B-cache. Because of large size and more complicated associativity policy, D-cache read/write latencies were increased to 3 clocks (to/from an integer register) and 4 clocks (to/from a floating-point register). D-cache remained dual-ported, though unlike EV5 it wasn't composed of 2 identical parts, but represented a single part clocked at twice the core frequency. External B-cache as big as 1MB-16MB, direct-mapped, write-back, used an independent 128-bit bidirectional data bus (with additional 16-bit ECC protection), and also an independent 20-bit unidirectional address bus. It consisted of LW SSRAM chips (late write), and after that of DDR SSRAM units (double data rate). B-cache working frequency could be set from 2/3 to 1/8 of the full core frequency, and unlike the previous generations of Alpha processors, now B-cache itself wasn't optional. The system data bus was only 64-bit wide (with additional 8-bit ECC protection), bidirectional, but used DDR technology. The system address bus was 44 bits wide, implemented physically through two 15-bit unidirectional channels, with no DDR support. The system control bus was 15 bits wide, and also did not support DDR. The basic working principle of the system bus was modified, so the bus became dedicated (instead of shared), so that every processor featured its own dedicated path to a chipset.





