Well I was not actually following any CPU but now when I have to purchase a new PC so I need to follow. I hope that Bulldozer what it meant to be. I have high expectations from it.
Thanks for the link Skud that was quite informative for a noob like me.
I had planned a short preview on amd’s new bulldozer architecture but couldn’t find sufficient details on it. The first architectural details that surfaced some months back were vague tbh and were not sufficient for proper interpretation. Now AMD has released some slides which discusses its architectural design in a more finer detail and what exactly bulldozer brings to the table. I have taken help from various sources to write this.
Out of order execution –
We already know that bulldozer has two tightly coupled out of order processing units which it call modules. Talking about out of order, this is where its competitor intel was superior cause in its latest sandybridge microarchitecture , intel had greatly worked on the out of order execution capabilities. Out of order execution is nothing but the new order in which instructions in a thread are executed and not the original order of instructions the program already has.
Lets say a thread has 4 instructions A,B,C & D where A depends on C & B depends on D. Now when a processor’s execution engines process the thread in the original order , there will be a lot of dependencies and a significant delay because A will wait for D’S execution and B will wait for C’s execution.
But lets say we change the order to D ,A ,C & B. Then there will be no dependencies and execution will go smoothly without any halts or delays. This is out of order execution and is handled by the cpu’s frontend which we will discuss more. Bulldozer improves greatly on this part. The instructions in found in today’s apps and programs are far complex than the above example. So out of order execution significantly improves a processor’s efficiency and execution time.
Bulldozer Die –
*i55.tinypic.com/sc98wg.jpg
The standard bulldozer die has a modules or the execution cores which has 128kb L1 cache (32kb/module or 16kb / core). It will have a 64 bytes cacheline which is nothing but data in each location of the cache which has an index, tag and the data. Index is the address and is unique for each cacheline . Tag is the index value of the main memory present in cache. Data hold the value fetched from main memory.
When the cpu needs to read or write a location in main memory, it first checks whether that memory location is in the cache. This is accomplished by comparing the address of the memory location to all tags in the cache that might contain that address. If the processor finds that the memory location is in the cache, we say that a cache hit has occurred; otherwise, we speak of a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. The proportion of accesses that result in a cache hit is known as the hit rate, and is a measure of the effectiveness of the cache for a given program or algorithm.
Talking about associativity , it’s the number of times, the copy of main memory goes to a cpu cacheline.
Bulldozer has a 4-way associativity in its L1 cache.
The L2 cache is big and is 8MB in totally or 2mb dedicated per module with same 64 byte cacheline but is 16 way associative.
The die also has the north bridge integrated with 8mb L3 cache which is shared among four modules, 2 72 bit wide memory channels( as seen in the diagram) and 4 16 bit transmit & 16 bit receive hypertransport links for communication with other peripherals.
Bulldozer module –
*i56.tinypic.com/2zj99hz.jpg
The bulldozer module contains the fetch and decode (frontend) which we will discuss later. Each bulldozer module has a fetch and decode unit. The actual execution units consist of two separate integer units that contains a scheduler 4 instruction pipelines and L1 data cache. Scheduling is nothing but the assignment of processes to the cpu execution unit for processing and is a very fundamental aspect in a multitasking environment. We will discuss more in detail. The integer units work only on an integer data set.
Next up is a single floating point execution unit that packs the floating point scheduler , two 128bit FMAC pipelines and the L2 cache. Note that the l2 cache is shared by both integer and floating units but l1 cache is dedicated to integer units only. More details will follow later.
Bulldozer Core ( shared frontend)-
*i53.tinypic.com/es26o0.jpg
Ok now lets delve a bit deeper shall we?
Fetch -
The front-end has been entirely modified and is now responsible for feeding both cores within a module. Bulldozer’s front end includes branch prediction, instruction fetching, instruction decoding and macro-op dispatch. These stages are effectively multi-threaded with single cycle switching between threads.
Since Bulldozer is termed as a high frequency design, branch prediction is critically important. Deeper pipelines tend to have more instruction in-flight at a given time, thus mispredicts can result in mixing of more instructions. The number of instructions mixed of is directly related to the wasted energy and performance.
The branch predictor is shared by the two cores in each module and decoupled from the instruction fetching via a pair of prediction queues (one queue per core). The branch predictor can run-ahead and will continue to predict new relative instruction pointers (RIPs) unless the queues are full.
The first step in branch prediction is determining the direction – whether a branch is taken or not. Once a branch is predicted as taken, the next step is to determine the target. For branches with a single target address, the branch target buffer (BTB) has been substantially expanded and now uses a two level hierarchy, similar to Nehalem. The L1 BTB is a 512 entry, 4-way associative structure that resolves predictions with a single cycle penalty to the pipeline. The L2 BTB is much larger with 5120 entries and 5-way associativity, but the extra capacity will cost additional latency for a L2 BTB hit. The BTBs in Bulldozer are competitively shared by both cores, but provide greater coverage than the 2K entry BTB in Istanbul.
Once a RIP is placed into the prediction queue and passed to the next stage, the fetch unit accesses the dynamically shared ITLBs (Instruction TLB) and L1 I-cache. The L1 ITLB is fully associative with 72 entries for the various page sizes. Tlb or translation lookaside buffer maps the virtual and physical address spaces. So here, the instruction pointer in the prediction queue are mapped with tlb and instruction cache.
Decode –
*i53.tinypic.com/2qjvbt0.jpg
The decode phase of bulldozer has been modified and now consists of a micro code engine and 4 decoders. After taking the fetched instruction from the 32 byte buffer queue, its decoded in 4 instructions per cycle owing to 4 physical decode engines. After examining the instruction window, the decoders translate each x86 instruction into 1 or 2 macro-operations and place them into a queue for dispatching. Microcoded instructions (i.e. those requiring more than 2 macro-operations) are handled by the microcode ROM . In addition to having an extra decoder, Bulldozer is the first AMD CPU with branch fusion, whereby an adjacent arithmetic test or comparison and a jump instruction are decoded into a single macro-op
Then the decoded macro-ops instructions are dispatched and handled by the execution units present in the cores.
Before delving further into decoding, it is useful to discuss some terminology. Intel refers to the variable length x86 instructions as macro-operations. These can be quite complex, with multiple memory and arithmetic operations. Intel refers to their own simpler internal, fixed length operations as micro-ops or uops. AMD’s terminology is subtly (and confusingly) different. In AMD parlance, x86 instructions are referred to as AMD64 instructions – again, variable length and potentially quite complex. An AMD macro-operation is an internal, fixed length operation that may include both an arithmetic operation and a memory operation (e.g. a single macro-op may be a read-modify-write). In some cases, AMD also refers to these macro-ops as complex ops or cops. An AMD uop is an even simpler, fixed length operation that performs a single operation: arithmetic, load or store – but only one, and not a combination. So in theory, an AMD macro-op could translate into 3 AMD uops. The best way to think about AMD’s arrangement is that macro-ops are how the out-of-order microarchitecture tracks work, while uops are executed in execution units.
Coming down to the execution cores, lets first look at the integer cores. There are two integer cores in a single bulldozer module. Now the decoded instructions from the dispatch queue in the form of macro-ops are stored in the instruction retirement queue. Upto 4 macro-ops can be stored per integer core.The retirement queue also maintains the bookkeeping logic for each macro-op.
Memory operations must also allocate an entry in the appropriate load or store queue, to maintain x86 consistency. However, any FP or SIMD macro-ops will be sent to the floating point cluster to continue execution, although the retirement status is tracked in the integer core. Note that a FP or SIMD memory access will dispatch both a memory macro-operation to the integer core and an execution macro-operation to the FP cluster.
Bulldozer’s physical register file (PRF) based renaming is a fundamental change to the out-of-order microarchitecture. In microarchitectures using a physical register file, there also are two structures to hold the state, but they are divided based on function. One structure is the physical register file, which holds all the data values – both for speculative renamed registers and architectural registers. The other structure is what AMD calls the retirement queue, which holds a pointer to the appropriate entry in the PRF and also status information about each in-flight macro-op and the associated register (e.g. is the register speculative or architectural). This microarchitecture enables lower power retirement and rollback, by manipulating the retirement queue pointers that map into the PRF, rather than copying register values. In Bulldozer, up to 4 macro-ops can be retired each cycle, matching the throughput in the rest of the CPU. Branch mispredicts are handled by a flash clear, which likely reverts the retirement queue to a prior known good state.
Once renamed, macro-operations are placed into the 40 entry unified scheduler where they are held until all the necessary resources, such as source operands are available. When uops are ready, the scheduler will issue up to four of the oldest uops to the appropriate execution units.
Integer execution units –
Bulldozer’s integer execution units can be thought of as two mostly identical groups (0 and 1) and the ALU.The two units (0 and 1) are identical and perform the address calculations that feed into the load-store unit and cache hierarchy. The integer execution units (EX 0 and EX 1) are each a fully featured ALU capable of executing the vast majority of integer operations. Ex 0 is the integer divider whereas EX 1 is the integer multiplier.
Shared Floating Point Cluster –
*i53.tinypic.com/6ss4cj.jpg
*i54.tinypic.com/2nk71fo.jpg
The floating point cluster for Bulldozer has its own four-wide out-of-order execution facilities including renaming, scheduling and register files. We also know that it relies upon the integer cores for handling any loads and stores and also retiring macro-ops.
When a dispatch group is sent to a core, any FP or SIMD macro-ops are allocated an entry in the retirement queue, just like the integer macro-ops. However, the floating point or SIMD macro-ops are then sent to the FP cluster for renaming, scheduling and execution.
The incoming FP and SIMD instructions follow a similar manner as the integer part. The decoded instructions or macro-ops are renamed in a PRF (similar to integer PRF). It is dynamically shared between the two cores and has 160 entries. So that’s a wide PRF.
If the PRF is 128 bits wide, and lets say an instruction is 256 bit, it will decode as two macro-ops and use two entries in the PRF, Scheduler and retirement queue. Then the macro-ops are moved into the scheduler which contains 60 macro-ops. Then these macro-ops are fed to the appropriate pipeline for execution.
Shared Floating Point Execution Units –
All execution units in bulldozer are 128 bits wide because it can execute the new AVX instruction. Thus all 256 bits instructions will be divided into two macro-ops. Like the integer cores, bulldozer does away with a dedicated scheduler and lanes and prefers a flexible unified approach. All the 4 pipelines are fed into the 60 entry scheduler.
The heart of the cluster contains two 128bit FMAC (FLOATING POINT MULTIPLY-ACCUMULATE) units that can also perform division and squareroot with variable latency.
In addition, they also execute FADD & FMUL instructions including AMD’s own XOP instruction set as well for numeric, multimedia and algorithms as well. The first pipeline or first FMAC also converts integer to floating point numbers whereas the second pipeline does permutations, combinations , shifts, shuffles, packing and unpacking operations as well.
** More info on memory and power management will be added shortly.
AMD's A-list is currently seeing what we know as Bulldozer at a secret presentation in Huston, Texas. They will be showed AMD's much delayed answer to Core i7 and Core i5 processors but our multiple, well informed sources claim that it won’t be able to beat the Core i7 second generation, Sandy Bridge core. However not everything is lost as AMD's six and eight-core CPUs will end up significantly cheaper and will offer pretty good value for money.
The issue is the IPC, instructions per clock don’t perform as well we they should and they cannot do enough calculations fast to beat Intel's current offering. Let’s not even mention that the third generation of Core i7, Ivy Bridge at 22nm will be even faster. And there is Sandy Bridge E coming to take over the crown lead.
AMD at least got its act together and can produce this rather complicated chip in volume, as our sources claim that Bulldozer yields are at good levels, and that they are in much better shape than Llano, AMD’s A-series APU.
The good thing is that this chip is showing its face after a two year delay. Bear in mind that Bulldozer is the first massive change of AMD core after K8 and K10 and that it has to last for a very long time.
is bulldozer available now?
I've enquired in Lamington Road shop and they say its available..
So are they providing fake processors or anything like that?
Need help here
Hi Guest we just wanted to alert you to a major change in the forum. We will no longer be allowing the posting of outgoing links. Please use the attachment feature to attach media to your posts.