|
Chapter 2, Opteron's Floating Point
Units
|
|
|
|
|
|
2.1 The Floating Point Renamed Register
File |
|
Opteron's Floating Point renamed
register file has been increased from 88 to 120 entries. It is
a renamed register file in the classical meaning of the word. It's a
single entity that must contain all architectural (non-speculative)
and speculative values for the registers defined by the instruction
set.
The Opteron restores the support
for 72 speculative instructions again. The support for speculative
instructions was decreased from 72 to 56 with the introduction of
the Athlon XP core that included the eight 128 bit XMM registers for
SSE but did not increase the size of the 88 entry renamed register
file.
Each 128 bit XMM register uses two
entries in the renamed register file. The Opteron thus uses 32
entries to hold the architectural (retired) state of the now 16 XMM
registers, which explains the increase: 88 + 32 makes 120
entries.
40 of the 120 entries are used to
hold the architectural (non-speculative) state of the registers
defined by the instruction set. 32 are used for the sixteen
XMM registers. 8 are used for the eight x87/MMX registers.
A further 8 register entries are
used for micro code scratch registers, some- times called
micro-architectural registers. These registers are not defined by
the instruction set and are not directly visible to the programmer.
They are used by micro code to calculate complex floating point
calculations like sine or log instructions.
The 48 (40+8) entries
that define the architectural state of the processor are defined by
the 48 entry Architectural Tag Array. The entries that hold the very
latest speculative values for the 48 architectural register entries are
identified with the 48 entry Future File Tag Array.
The speculative state of the
processor needs to be discarded in case of a branch-miss-prediction
or exception. This is handled by overwriting the 48 entries of
the Future File
Tag Array with those
of the Architectural
Tag Array.
Each entry of the renamed register
file is 90 bit wide. Floating Point Values are expanded to a total
of 90 bits (68 mantisse, 18 exponent, 1 sign bit and 3 class bits)
The three class bits contain extra information about the floating
point number. The class bits also identify non floating point
numbers (integers) which are not expanded when written in the
renamed register file.
|
|
|
The 120
registers |
|
8
32
8 |
non speculative
registers:
FP/MMX registers
(arch.)
SSE/SSE2 registers
(arch.)
Micro Code Scratch
registers (arch) |
|
8
32
8
24 |
speculative
registers
FP/MMX registers (
latest )
SSE/SSE2 registers (
latest )
Micro Code Scratch
reg. (latest )
Remaining
speculative |
|
|
The 90 bit registers |
|
68
18
1
3 |
subdivision of the 90 bits
for FP
Mantisse
bits
Exponent
bits
Sign bit
Class Code
bits |
|
|
Definition of the
3 bit Class Code |
|
0
1
2
3
4
5
6
7 |
Zero
Infinity
Quit NAN
(Not A Number)
Signaling
NAN (Not A Number)
Denormal
(very small FP number )
MMX / XMM (non FP
contents)
Normal ( FP
number, not very small )
Unsupported | |
|
|
|
|
|
|
|
|
|
2.2 Floating Point rename stage
1: x87 stack to absolute FP register
mapping |
|
The "stack features" of the legacy
x87 are undone in this first stage of the Floating Point
pipeline. The x87 instructions access the eight architectural
80 bit registers via a 3 bit Top Of Stack (TOS) pointer.
Instructions use the TOS as both source and destination. The second
argument can be another value on the stack relative to the TOS
register or a memory operand. The 3 bit TOS pointer is maintained in
the 16 bit x87 FP status register.
The x87 TOS register relative
references are replaced by absolute references which directly
identify the x87 registers involved in the operation. A speculative
version of the TOS pointer is used for the translations. The 3 bit
pointer can be updated by the actions of up to three instructions
per cycle. Instructions can be speculative but are still in order
at this stage. They've not yet been scheduled by the Floating Point
Out-Of-Order scheduler.
If an exception or a
branch-miss-prediction occurs then the speculative TOS pointer is
replaced with the non-speculative retired one which is
retrieved from the reorder buffer. The retired version
reflects the value of the TOS during the instruction just prior to
the one that caused the exception or branch miss prediction.
|
|
2.3 Floating Point rename stage
2: Regular Register Renaming |
|
The actual register renaming takes
place in this stage. Each instruction that needs a destination
register gets one assigned here. The destination registers must be
unique in respect to all other instructions in flight. No
instructions may write to the same
register.
Up to three free register entries
are obtained from the register free list. There are 120
registers available in total. The free-list can have a maximum of 72
free entries, equal to the maximum number of instructions in
flight.
The remaining 48 entries hold the
values of the (non-speculative) architectural registers: The eight
x87/MMX registers, The eight scratch register (accessible by micro
code only) and the sixteen 128 bit XMM registers for SSE and
SSE2, each using two entries. These registers are not at a
fixed location but may occupy any of the 120 entries. This is what
makes the free-list necessary. The 48 entries occupied by the architectural
registers mentioned above are identified by the 48 entry
Architectural
Tag Array. It has an entry for each architectural register with a
value that points to one of the 120 renamed
registers.
Up to three instructions can thus
be renamed per cycle. The data dependencies are handled with the aid
of another structure,
the 48 entry Future File Tag Array This array contains
pointers the 48 renamed registers that contain the very latest
speculative values for each of the architectural registers.
The instructions that are getting renamed access this structure to
obtain the renamed registers were they can find their source
operands. The instructions will then store the renamed register
which was allocated to them to the Future File Tag Array so that subsequent instructions know were to find the
result data.
Example: An instruction uses
architectural registers 3 and 5 as input data and writes its result
back into register 3. It will first read entries 3 and 5 to obtained
the pointers to the renamed registers that contain or will contain
the latest values for register 3 and 5.
Say renamed registers 93 and 12. The instruction now knows its
source registers, 93 and 12 and can overwrite entry 3 of the
Future File Tag
Array with the
renamed register it was assigned to store it's result, say 97. A
subsequent instruction that needs architectural register 3 will now
use renamed register 97.
If an exception or
branch-miss-prediction occurs then the 48 entries of the
Future File Tag
Array are overwritten
with the 48 entries from the Architectural Tag Array. All speculative results are
thereby discarded. The pointers in the Architectural Tag Array were
written there by the retirement logic. Up to three values can be
written per cycle for each line of instructions that retires. The
values are taken from the Reorder Buffer. The Reorder Buffer is shared by all
instructions.
Floating Point Instructions that
finish write certain information like exception status, TOS used
et-cetera into the Reorder Buffer. This information includes also
the destination register they modify, Both the number of to
the architectural register and the renamed register are stored in
the Reorder Buffer. The two of them are used to update the
Architectural Tag
Array at retirement.
One as the data and the other as the entry number of the
Architectural Tag Array.
|
|
2.4 Floating Point instruction
scheduler |
|
The Floating Point scheduler uses
the following three criteria to determine if it may dispatch an
instruction to the execution pipeline it has been assigned to (
FPMUL, FPADD, FPMISC )
1) The instructions source
registers and or memory operands will be available.
2) The instruction Pipeline
to which the instruction has been assigned will be available.
3) The result bus for that
instruction pipe will be available on the clock cycle in which the
instruction will complete.
The scheduler will always dispatch the oldest
instruction that is ready for each of the three pipelines.
When we say will be available then we mean in two cycles from the current cycle. It
takes two cycles to get an instruction into
execution, one to schedule and another to read the 120 entry renamed
register file. An instruction checks if its source registers
are available first when it is placed in the scheduler. After that
it will continuously monitor the Tag busses of the result
busses for all source data still missing.
The Tag busses run two cycles ahead of the result
busses. The scheduler can thus see two cycles in advance which
results will become ready. A dispatched instruction will arrive in
two cycles at its execution were it grabs the incoming result data
from the selected result bus. The execution pipelines are 4 stages deep.
Instructions with lower latencies may leave the pipeline earlier,
after two or three cycles. Two cycles however is the shortest
execution latency.
Instructions that need load data
from memory wait until the data arrives from the L1 Data Cache or
from further away in the Memory Hierarchy. The scheduler knows two
cycles in advance that data is coming. This is one cycle more than
for integer loads. The extra cycle stems from the Data Convert and Classify
unit that
pre-processes Floating Point data from memory.
A load miss avoids that the Instruction which
needed the load data is removed from the scheduler. The instruction
stays in the scheduler until the data arrives with a load
hit. Any instruction that was
scheduled depending on load that missed is invalidated and its
results are not written to the register file.
|
|
2.5 The 5 read and 5 write ports of the
floating point renamed register file |
|
The renamed register file register
file is accessed directly after the instructions are dispatched Out
Of Order by the Scheduler.
Up to three instructions can
access the register file simultaneously. One instruction for
each of the three functional units. The FPMUL and FPADD instructions
obtain two source operands each while instructions for the FPMISC
unit only need a single operand.
Three write ports are available to write results from
the floating point units back to the register file. The write
addresses arrive earlier then the result data. This is used to
decode the write address in the cycle before the write
occurs. All three units can have memory
data as a source operand. The reorder buffer tags that accompany the
data coming from memory are translated to renamed register locations
by the load mapper. Two 64 bit loads can be handled per
cycle.
The new 120 entry register file
shows bypass logic at both sides. The bypasses are used to pass
result and or load data directly to succeeding dependent
instructions. Thereby avoiding any extra delay that would result
from the actual writing and reading from the register file.
|
|
2.6 The Floating Point processing
units |
|
There is a range of processing
units connected to the FPMUL, FPADD and FPMISC register file ports.
The ports determine to which of the three floating point pipelines a
particular unit belongs.
The x87 and SSE2 floating point
multiplier handles 64
and 80 extended length multiplications. The large Wallace tree which
handle the 64 bit multiplications for 80 bit extended floating point
and 64 bit integer multiplications can be split into two independent
Wallace trees that handle the dual 32 bit SIMD multiplications used
for SSE and 3Dnow functions ( US Patent
6,490,607 ) This unit can also autonomously handle floating
point divide and square root functions. These instructions are not
implemented with micro code but are handled entirely by this unit
itself with a single direct path instruction. The unit contains
bi-partite lookup tables for this purpose. ( US Patent
6,256,653 ) These table contain base values and
differential values for rapid reciprocal and reciprocal square root
approximations which are then used as a start point for the divide
and the square root instructions. This unit is connected to the
FPMUL ports of the register file.
The x87 and SSE2 floating point
adder handles 64 and
extended length additions and subtractions. It is connected to the
FPADD ports of the register file.
The 3Dnow! and SSE dual 32 bit
floating point unit handles the single length SIMD floating point
instructions as introduced in 3dnow! by AMD and SSE by Intel (The
later is called 3Dnow! professional in the Athlon XP). This unit is
connected to both the FPMUL and FPADD ports and can handle one 64
bit (2x32) instruction of each group per cycle, So one MUL type and
one ADD type instruction per cycle. 128 bit instructions of either
type have a throughput of one per two cycles.
The 2x64 bit MMX/SSE ALU
unit is a dual unit
that can handle certain packed integer 128 bit SSE instructions at a
throughput of 1 per cycle. It is connected to both the FPMUL and FPADD ports. The
FPMUL ports are used even though the instructions aren't
multiplications but rather adds, subtracts and logic functions. The
idea is to double op the size of operands that can be read and
written to the register file to a full 128 bit. The 128 bit
SSE instructions are still handled by two individual 64 bit
operations. The throughput is increased to one per cycle because
they can be executed by both the FPMUL and the FPADD
pipelines.
The 1x64 bit MMX/SSE Multiplier
unit handles MMX and
SSE integer multiplies. It is connected to the FPMUL ports of the
register file. It can handle a single 64 bit MMX instruction per
cycle or 128 bit SSE instruction with a 2 cycle throughput using two
64 bit operations.
The FP Store unit, more recently called the
FP Miscellaneous unit handles not only the stores but
also a number of other single operand functions such as Integer to
Float and Float to Integer conversions. It further provides a lot of
functions used by Vector Path generated micro code to handle more
complex x87 operations. It contains the Floating Point Constant ROM
that contains a range of floating point constants such as pi, e,
log2 et-cetera.
|
|
2.7 The Convert and Classify
units |
|
Load data that arrives from the L1
Data Cache or from further on the Memory Hierarchy goes through the
Convert and Classify
unit first. The Load
data is converted, if appropriate, to the internal 87 bit floating
point format (1 sign bit, 18 exponent and 68 mantisse bits ).
The floating point values are also classified into a three bit Class
code. The 87+3=90 bits are then stored into the 90 bit register
file. The Class code can sub-sequentially be used to speed up
floating point operations. For example: Only the class code needs to
be tested to find out if a number is zero instead of all 86 mantisse
plus exponent bits.
We've seen that the Floating Point
Scheduler runs two cycles ahead of the actual execution units. One
cycle more than the Integer Scheduler. It observes at the Tag busses
that identify two cycles in advance which results will become ready
at a certain result bus. The Tag busses also indicate which data
will come from memory in advance. However, the hit/miss signal may
later indicate that the data was erroneous because of a Cache Miss. The Convert and Classify units
add an extra cycle with at least somewhat useful work in order to
give the scheduler the time to take the Hit/Miss signal into
account.
The Optimization manual has a
whole appendix (E) dedicated to SSE and SSE2 optimizations related
to the classification of the contents of the SSE registers.
Instructions that operate on another data type then expected should
be avoided. Revision C does not need these optimizations anymore. It
is likely that Revision C can perform these format translations
itself without the intervention of microcode after an
exception.
|
|
2.8 X87 Status handling: FCOMI /
FCMOV and FCOM / FSTSW pairs
US Patents
6,393,555 &
6,425,074 |
|
AMD has managed to eliminate much
of the x87 legacy overhead and did speed up some important but
problematic functions. More specifically for the x87 status
register. Early Athlons used a large area to handle the processing
of the 16 bit floating point status register. This has all gone,
some of it already in the Athlon XP.
Program code with a conditional
test on x87 floating point values used to kill Out-Of-Order
advantages because of the serializing nature of the instructions
that make the floating point status code available to the Integer
Pipeline which handles the conditional branches. The Opteron has
special hardware to avoid this serialization and to preserve Out Of
Order processing.
|
x87 Floating Point Status register
|
15 |
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
x87
FP
Busy |
Cond.
Code
3 |
Top
of Stack |
Cond.
Code
2 |
Cond.
Code
1 |
Cond.
Code
0 |
Excep
tion
Status |
Stack
Fault |
Preci-
sion
excep |
Under-
flow
excep |
Over-
flow
excep |
Zero
Divide
excep |
Denorm
Oper.
excep |
Invalid
Oper.
excep |
|
B |
C3 |
TOS |
C2 |
C1 |
C0 |
ES |
SF |
PE |
UE |
OE |
ZE |
DE |
IE |
|
Different Parts of the x87
floating point status register are handled in different ways. The
register is a bit of a mixture of different things. It contains for
example the 3 bit TOS pointer that indicates which of the eight x87
is the current top of stack. The first Rename Stage holds the
speculative version of this pointer. It is used here to translate
the TOS relative register addresses to absolute x87 register
addresses. All finishing instructions preserve their copy of this
value in the Re-Order buffer when they finish. These copies then
become the non-speculative versions of TOS at the moment that the
instructions are retired out of the Re-Order
buffer.
The Retirement Logic may detect
that an exception or branch-miss-prediction did occur. It then
replaces the speculative version of the TOS in the first rename
stage with latest retired, non-speculative version. The speculative 3 bit TOS value is
used before the instructions are scheduled Out Of Order. The only
reason that it is used later on is during Retirement which is
handled In-Order again. This means that special Out-Of-Order
hardware for the TOS can be, and is eliminated.
The execution of a during Floating
Point instruction may itself cause an exception. Most bits of the
x87 status register are dedicated flags that identify exceptions.
Exceptions are always handled In-Order at retirement time. This
again means that any special Out-Of-Order hardware for these bits
can be, and is eliminated.
The tricky part is in the CC
(Condition Code) bits. These bits contain exception data most of the
time but may contain sometimes information which is the result of a
Floating Point compare and which must be processed in a full
Out-Of-Order fashion. The Opteron has special new hardware to handle
these cases. This hardware detects combinations of instructions that
need special handling.
|
Condition Code bits after a x87 Floating Point
compare
|
Cond.
Code
3 |
Cond.
Code
2 |
Cond.
Code
1 |
Cond.
Code
0 |
Compare
Result |
|
0 |
0 |
0 |
0 |
ST(0) >
source |
|
0 |
0 |
0 |
1 |
ST(0) <
source |
|
1 |
0 |
0 |
0 |
ST(0) =
source |
|
1 |
1 |
0 |
1 |
Operands were
unordered |
|
The first combination is a FCOMI
with a FCMOV. The first does a compare and sets the CC bits
according to the result. It then moves the compare result to the
Integer status Register. The FCMOV then does a conditional floating
point move depending on the Integer Status bits. Opteron's hardware
allows full speed processing here by implementing an Out-Of-Order
bypass that avoids that the FCMOV has to wait for the actual Integer
Status Flags.
The second combination is the FCOM
and FSTSW pair. The first instruction is identical to the FCOMI
instruction with the exception that it does not copies the CC bits
to the Integer Status bits. It's the FSTSW (Floating point Store
Status Word) instruction that copies the 16 floating point status
bits to the AEX register or to a Memory Location from were they can
be used for conditional operations. The later is a serializing
operation because all floating point instructions need to finish
first before the 16 status flags are known. The Opteron has special
hardware that does allow maximum speed Out-Of-Order processing
without the serializing disadvantage. It also provides a way to
recover from any (rare) miss predictions.
The result of all AMD's x87
optimizations is that the Opteron literally runs circles around the
Pentium 4 when it comes to x87 processing. It has removed large
special purpose circuits for status processing and implemented a few
small ones that handle the cases mentioned. The shift to SSE2
floating point however will make removed area overhead more
important than the speed-ups.
|
|
|
|
|
|
Chapter 3, Opteron's Data Cache and
Load / Store units
|
|
|
|
|
|
3.1 Data Cache: 64 kByte with three cycle
data load latency |
|
The Opteron's relatively large L1
Data Cache supports a three cycle Load-Use latency. Actually only the second and third cycle are used to
access the Cache memory itself. The first cycle is spend in the
Integer Pipeline for the x86 memory address calculation using one of
the three available AGU's. The address calculated by the AGU is send
to the memory array in the second cycle where it is decoded. This means that it is known at
which word
line the data can be
found at the end of the second cycle.
The right data word is activated
at the beginning of the third cycle. Data is accessed in the memory
array, selected and send forward to the Integer Pipeline or the
Floating Point pipeline. Below the more detailed timing of a typical
Integer x86 instruction
F( reg,mem ). This type of instruction
first loads data from memory and then performs an operation on
it.
We see that in the same
cycle in which the instruction is dispatched to the Scheduler it is
also dispatched to the so-called "Pre-Cache Load/Store unit" or
simply LS1. Instructions in this unit compete for cache access
together with those in LS2. The instructions in LS1 first need
to wait for their effective memory address. They monitor the result
busses of the AGU's. An instruction in LS1 knows from which AGU it
can expect its address. Instructions check the re-order buffer Tag
which identifies the address one clock-cycle in advance. In general,
an instruction in LS1 will fetch its address and wait for its turn
to probe the cache.
|
Typical timing of an F ( reg, mem ) x86
operation.
|
Cycle
|
Integer
Scheduler |
Load /
Store
Unit (LS1) |
ALU's and
AGU's |
Cache Address
Decode |
Cache Data
Access |
|
0 |
Dispatched
to
Scheduler |
Dispatched
to LS1 |
|
|
|
|
1 |
AGU
Scheduled |
|
|
|
|
|
2 |
|
Load
Scheduled |
Address
Generation |
|
|
|
3 |
|
|
|
Cache
Address
Decode |
|
|
4 |
ALU
Scheduled |
|
|
|
Cache Data
Access |
|
5 |
|
|
Dependent
Operation |
|
|
|
Instructions may route the address
immediately to the cache also if there are no other (older)
instructions waiting. This is the case in our example above. In any
case, each instruction will keep the address for possible follow-on
actions. The address is send directly from the AGU result bus to the
Data Cache's address decoders in our case here. Data comes
back from memory one cycle later and is routed to the Integer
Pipeline. LS1 places the re-order buffer Tag one cycle in
advance on the Data Cache result Tag bus so that the Integer ALU
schedulers can schedule any instruction depending on the load
data.
|
|
3.2 Two accesses per cycle, read or
write: 8 way bank interleaved, two way set
associative |
|
The Opteron's cache has two 64 bit
ports. Two accesses can occur each cycle. Any combination of loads
and stores is possible. The dual port mechanism is implemented by a
banking mechanism: The cache consist of 8 individual banks, each
with a single port. Two accesses can occur simultaneously if they
are to different banks.
|
Virtual Address bit used to access the L1 data
Cache
|
Cache Line Index |
Bank |
Byte |
|
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
A single 64 byte cache line is
subdivided in 8 independent 64 bit banks. Two accesses are to two
different banks if their addresses have a different bank-field,
address bits 3 to 5. The bits are the lowest possible address
bits that can be used for this purpose. This schem effectively maps
adjacent 64 bit words in different banks. The principle of
data
locality makes these
bits the most suitable choice.
The 64 kByte Cache is two way
set-associative. The
cache is split in two 32 kByte ways accessed with Virtual Address bits [14:0]
A hit into any of the two ways is
detected if the Physical Address Tag, bits [39:12], which is stored alongside with each
cache line, is identical to bits [39:12] of the Physical Address. Virtual to Physical
address translation is performed with the help of the TLB's
(Translation Look aside Buffers). A port accesses 2 ways and
compares 2 tags with the translated address. Each port has its own
TLB to do the address translation.
The two 64 bit ports are used
simultaneously when exchanging cache-lines with the rest of the
memory hierarchy. This means that the memory bus from the unified L2
cache to the L1 data cache is now 128 bit wide. The event where a
new cache line is needed will take first 4 cycles to evict the old
cache-line and then 4 cycles more to load the new cache-line when it
arrives.
|
|
|
|
|
|
|
|
|
|
3.3 The Data Cache Hit / Miss
Detection: The cache tags and the primairy
TLB's US Patent
6,453,387. |
|
The L1 Data Cache has room
to store 1024 cache lines out of the total of 17,179,869,184 cache
lines that fit within the 40 bit physical address space.
Accesses need to check if the stored cache line corresponds with the
actual memory location they want to access. It is for this purpose
that the Tag rams store the higher physical address bits belonging
to each of the 1024 cache-lines. There are two copies of the Tag ram
to allow the simultaneous operation of two access
ports.
The Tag rams are accessed with
bits [14:6] of the virtual address. Each Tag ram outputs 2 Tags for
both ways of the 2 way- set-associative
cache. The wanted cache-line can be in either way. The Tag rams
contain physical addresses. A physical address uniquely defines a
memory position throughout the entire distributed system
memory.
The cache is however accessed with
the virtual addresses as defined by the program. Virtual addresses
have only a meaning from within a process context. This means that a
virtual-to-physical-address translation is needed to be able to check the
physical Tags. This translation is handled by a lengthy sequence of
four table lookups in memory: The virtual address field
[47:12] is divided into four equal sub-fields that each indexes into
one of the four tables. Each table points to the start
of the next table, The last table, the page table, then finally
contains the translated address.
|
Virtual Address to Physical Address Translation:
The Table Walk.
|
virtual
address |
page
offset |
|
page map level
4
table
offset |
page
directory
pointer
offset |
page
directory
offset |
page table
offset |
|
47
39 |
38
30 |
29
21 |
20
12 |
11
0 |
|
$ |
$ |
$ |
$ |
$ |
|
|
$ |
$ |
$ |
$ |
|
9 |
|
$ |
$ |
$ |
|
|
9 |
|
$ |
$ |
|
|
|
9 |
|
$ |
|
|
|
|
$ |
$ |
|
|
|
physical
address |
page
offset |
|
39
12 |
11
0 | |
|
This so-called Table-Walk is a
very lengthy procedure indeed. The Opteron uses so-called
Translation Look aside Buffers (TLB's) to remember the 40 most
recently used address translations. 32 of these remember
4k page translations using the scheme
above. The remaining 8 are used for so-called 2M / 4M page translations which skip the
last table and define the translations for large 2 Megabyte pages. (
The 4M pages are only used for backwards compatibility
)
The virtual address bits {47:12] are
compared with all 40 entries of the TLB's in the second of the three
clock-cycle access. At the end of the second cycle we know if any
one of them matches. Each entry also contains the associated
physical address bits [39:12]. These are selected in the third
cycle and compared with the physical Tags to test if we have a cache
hit.
|
|
3.4 The 512 entry second level
TLB |
|
If the necessary translation is
not found within the 40 entries of the primary TLB's, then
there is a second chance that it is available in the level-2
TLB which is shared by both ports. This table contains 512
address translations. This larger table can be used to update the
primary TLB's with a
minor delay. It is
organized in a different way: It is 512 entry 4-way
set-associative.
This means that it has 128 sets of
4 translations each. Virtual address bits [18:12] are used to
select one of the 128 sets. We get four translations giving us
four chances that we have the translation we need. Each
translation contains the rest of the virtual address bits
[47:19]. We can check if we have the right translation by
comparing these bits with our address. The matching entry then
contains the associated physical address field [39:12] we
need.
|
|
3.5
Error Coding and Correction |
|
The L1 Data Cache is ECC protected
(Error Coding and Correction). Eight bits are used for
each 64 bits to be able to correct single bit errors and to detect
dual bit errors with the help of a 64 bit Hamming SED/DED scheme
(Single Error Detection / Double Error Detection) Six parity
bits are needed to retrieve the position of the error
bit.
|
|
E
C
C |
64 bit
Hamming SED/DED error location
retrieval
|
|
bit
63
bit 0 |
|
|
0 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
x |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The six bits are shown in the
column at the left. A one means that a parity error was detected.
The six bits represent the parity of the 32 purple bits in each
rows. The parity errors together now represent a 6 bit index that
points to the error position. Additional parity bits are used to
detect double bit errors and errors in the parity bits themselves.
(Thanks to Collin for bringing this to my attention)
|
|
3.6 The Load / Store
Unit, LS1 and LS2 |
|
The Load Store unit handle the
accesses to the Data Cache. This type of unit plays an increasingly
important role play in modern speculative out-of-order processors.
They are expected to grow significantly in size and complexity in
newer architectures on the horizon. An extra reason to give
the Opteron's Load Store units a closer look. The split in LS1 and
LS2 is sometimes described as LS1 being for the L1 Data Cache and
LS2 for the L2 Cache. This is far to popular however and even
incorrect. We'll go into more detail here.
|
|
|
|
|
|
|
|
|
|
3.7 The "Pre Cache" Load / Store
unit: LS1 |
|
The Pre-Cache Load/Store unit
(LS1) is the place where dispatched memory accesses wait for the
addresses generated by the AGU's (Address Generator
Units) LS1 has 12 entries, whenever a memory access is
dispatched to the Integer Scheduler it is also dispatched to an
entry in LS1. The re-order tag bus belonging to the AGU indicates if
the required Address is being calculated and available on the result
bus of the AGU in the next cycle. An access waiting in LS1 knows at
which AGU to look for the address.
When an instruction has its
address coming or did already receive it may then probe the cache.
There are two access ports. The two oldest accesses in LS1 will be
allowed to probe the Cache. Both load and store instructions probe
the cache. A load will actually access the cache to obtain the load
data. The store presents its address but will never write from LS1
to the Cache. Store instructions will only write after they've
received the data to be written and when they are retired.
Stores must be retired first
because the store instruction may be speculative and is discarded
later. Imagine that MicroSoft patches a buffer overflow exploit by
adding a test for the overflow. This test becomes a conditional
branch that prevents the write to the buffer in case of an overflow.
The overflow tends to never happen so the branch predictor will
predict it as not-taken, It will do so also in the case that it
finally does happen. The write to the buffer will now be executed
speculative.
So the actual writes to the cache
must be delayed until after retirement when it's verified that the
branch predictions were correct.
These deferred stores do not
introduce any real delays however. Loads that access the cache also
check LS1 and LS2 to see if there are any pending writes to the
memory location they are about to read. If so than they catch the
data directly from LS1 or LS2 without delay.
The stores in LS1 however do
present their address to the cache hit/miss detection logic. If it
turns out that the cache-line is not present then it may be loaded
as soon as possible from the Level 2 cache or from system
memory. This can be a good policy since there is a significant
chance that following loads will need the same cache-line. Stores
may receive the data they have to write to memory while waiting in
LS1 as long as the data comes in time, otherwise they move on to LS2
to receive the data there.
|
|
3.8
Entering LS2: The Cache Probe Response |
|
All accesses in LS1 probe the
Cache and then move on to the Post-Cache Load/Store unit ( LS2
)
An access can either be a Load, a
Store or a Load-Store (The latter reads first and then writes
the result back to the same location)
All accesses which came from LS1
first wait to see the results from the cache probe. If it was a
cache hit or a miss, If there was a cache parity error. They also
receive the physical address which was translated from virtual into
physical by the TLB's. Together with the physical address come the
page attribute bits which determine for instance if the memory is
cacheable or not.
Then in the following cycle, in
case there was a cache miss, the instructions receive a so-called
MAB tag ( Missed Address Buffer Tag
) This tag will later be used to see if a missed cache-line
arrives from the L2 cache or from system memory. The MAB tag
needs to be used instead of the generally used re-order buffer tags.
Multiple Loads and Stores may depend on the same cache line and thus
on the same MAB tag. All these accesses miss and they'll all
receive the same MAB tag.
The Bus Interface Unit (BUI)
will load missed cache-lines from the unified L2 cache or system
memory to fill the data-cache. It also presents the so-called
Fill-tag to LS2. This fill-tag is compared
to the MAB-tag of all accesses that missed. The accesses that match
the fill-tag are changed from miss to hit.
|
|
3.9 The
"Post Cache" Load Store unit: LS2 |
|
The so-called Post-Cache Load
Store unit ( LS2 ) has 32 entries. It is organized in a somewhat
"shift register" like way so that the oldest outstanding access ends
up in entry 0. Each of the 32 entries has numerous
fields. Many of these fields are accompanied with a comparator and
other logic to see if the fields matches a certain condition.
All accesses stay in LS2 at least until retirement, Accesses that
missed the cache will wait in LS2 until the cache-line arrives from
the memory hierarchy. All Stores wait in LS2 for their
retirement first before actually writing data to memory.
|
Various fields in an LS2 buffer entry
|
|
|
Type |
Address &
Data |
Tags |
Status
Flags |
Action
Flags |
.... |
|
|
|
|
|
|
|
|
Valid
Flags
|
Acc.
Type
|
Store
Data
64 bit
|
Virtual
Address
|
Physical
Address
|
mem
Type
|
Instr-
uction
Tag |
Write
Data
Tag |
Missed
Address
Buffer
Tag |
Cache
Hit /
Miss
|
Retired
access
|
Last
Store
in Buff
(LIB) |
Self
Mod.
Code
Flag |
Snoop
Re-
Sync
Flag |
Store
Load
For-
ward |
....
|
|
|
|
Retired Stores in LS2 that have
the hit/miss flag set to hit may use a cache port simultaneously with a probing store in LS1. The
retired store from LS2 writes to the data cache itself but does not
use the cache hit/miss logic. The probing store from LS1 only uses
the hit/miss logic but doesn't access the data cache itself. This
shared use is important performance wise because each store would
occupy a cache port twice otherwise, first while probing from LS1
and secondly when writing from LS2 after retirement. This would
halve the store bandwidth of the L1 Data Cache.
|
|
3.10 Retiring instructions in the Load
Store unit and Exception Handling |
|
All access instructions, Loads as
well as Stores stay in LS2 until they are retired. Loads may be
removed directly from LS2 when they are retired to make place for
new instructions. Stores must still write their data to memory. They
wait to do so until retirement when it is determined that no
exception or branch miss-prediction occurred. Writes are
removed from LS2 after they have committed their data to
memory.
LS2 has a retirement interface
with the re-order buffer. The re-order buffer presents the Tag
of the line that is being retired to LS2.
It only needs to present a single
Tag for up to three instructions in a line since these all have the
same tag except for the 'sub- index' which identifies the lane (0, 1
or 2). LS2 compares all-instruction tags with the
retirement-tag and set the Retired flag of those who match. Retired loads may be deleted
directly from LS2.
If the retirement logic of the
re-order buffer has detected a branch-miss prediction or exception
then all instructions matching the retirement tag and all those with
succeeding tags are discarded from LS2. The only ones left in LS2
are the retired stores that are waiting to commit their data to
memory.
|
|
3.11 Store to Load forwarding, The
Dependency Link File
US Patent 6,266,744. |
|
A Load probing the data cache will
also check the Load Store units to see if there are any outstanding
stores to the same address as the load. If it finds such a store (
and the store is before the load in program order ) then there are
two possibilities. If the store has already obtained the write data
from one of the result busses then these can be directly forwarded
to the load. If the store has not yet obtained it data then the load
misses and moves to LS2.
An entry is created in a unit
called the Dependency
Link File. This unit
now registers both the tags of the write data, ( which tells the
data-to-be-stored is coming in the next cycle ) as well as the Load
tag which is the be used to tell a following instruction that the
load data will be available. The Dependency Link File keeps
monitoring the write data tag, and then, as soon as it detects it,
puts the load instruction tag on one of the Cache Load tag
busses.
It does the same with the actual
data when it comes one cycle later. The result data from instruction
1 can be directly forwarded to the consuming instruction 4 in the
example below. Instructions 2 and 3 (the store and the load) are
bypassed in this case.
|
1) F(
regA,regD ); // register A is a
function of register A and register D
2) store (
mem, regA ); // store register A to memory
3)
load ( regB, mem ); // load register B from the same memory
location
4) F(
regD, regB ); // uses register B and
register D to calculate new value of register D
|
Miss-matched store to
loads: Stores that only modify part of the load data are not
supported. The load must first wait unit the store is retired and
stored to memory. The load may then access the cache to get it's
data which is a combination of the stored data and the original
contents of the cache. The optimization manual describes all
possible miss-match cases since they can lead to a considerable
performance penalty.
Multiple Stores to the same
address are handled with the so-called LIB flag ( Last In Buffer )
This flag identifies the most recent store to a certain address. A
newer load accessing the same address will choose this one. Multiple
partial stores to the same word were each modifies only a part of
the word are not supported by the Load Store buffer. They are not
merged in the Load Store buffer. They will be merged later on in the
cache after all stores are retired and written.
|
|
3.12 Self Modifying Code checks:
Mutual exclusive L1 DCache and L1 ICache
US Patent
6,415,360. |
|
Self Modifying Code (SMC) checks
must in principle be performed for each store. It must be tested if
the store does not modify any of the instructions in the Instruction
Cache or any following Instruction in flight in any stage of
execution. A significant simplification is made by making the L1
Data Cache and L1 Instruction Cache exclusive to each other: A cache-line can only exist in either
one, not in both at the same time. When a cache line is loaded in
the L1 Data cache then it will be evicted from the L1 Instruction
cache.
The first advantage is that the
contents of the Instruction Cache does not need to be tested any
further for SMC. The second advantage is that SMC checks may be
limited to Data Cache misses. Stores to un-cacheable memory must be
checked always.
( They always "miss" ) The store's
write-address is send from LS2 to the SMC test unit which is close
to the Instruction Cache. This units holds the cache-line addresses
of all the Instructions in flight. If there is a conflict then it
marks the store that caused the conflict. The reorder buffer
will discard all instructions which follow the store when the store
is retired.
|
|
3.13 Handling multi processing
deadlocks: exponential back-off
US Patent
6,427,193. |
|
Deadlocks can occur when multiple
processors fight for the ownership of the same cache-line. They do
so for instance if they both want to write to the same line. A
cache-line is generally loaded as soon as possible in case of a
cache-miss. This will cause the cache-line to be invalidated in
other caches in case of a store. Two processors get in a deadlock if
they keep invalidating each others cache-lines before they are able
to finish the stores.
An example given is the case where
two processor try to complete a store which is to an unaligned
address so that part of the store data goes to cache line A1 and
part of the store data goes to cache line A2. Unaligned stores of
this type are typically split into two stores by the
hardware. An exponential back-off mechanism is provided to handle
this kind of deadlock situations. A back-off time is introduced when the memory access remains
unsuccessful before retrying to become owner of the cache-line
again. This time grows exponentially after each unsuccessful try until one of the
processors finally succeeds.
|
|
3.14 Improvements for multi processing and
multi threading |
|
|
The
Opteron's micro architecture has a large number of
improvements related to multi processing and multi threading.
Very important improvements also for the desktop market.
Multi-processor on a chip solutions are just around the corner
and hyper- threading may take a significant step forward in
the near future with Intel's Prescott.
The ability
to perform multi processing and multi threaded applications
efficiently becomes essential. Switching contexts, starting
and ending of processes and threads as well as inter-process
and inter-thread communication is traditionally associated
with large overheads. Significant improvements have been made
to reduce these overheads to a minimum. |
|
|
|
|
3.15 Address Space Number (ASN) and Global
flag
US Patent 6,604,187. |
|
Different processes can have
different contexts That is: different translations from virtual to
physical addresses. A process switch will cause the Translation Look
Aside buffers to be invalidated ( flushed ). Large Translation
buffers won't help you a lot if they are frequently flushed which
then can lead to significant performance degradation. The Opteron
introduces a new mechanism to avoid flushing of the TLB's. An
Address Space
Number (ASN) register
is added together with an enable bit (ASNE).
The Address Space Number
is used to
uniquely identify a process. Each entry in the TLB now
includes the ASN of the process. An address can be successfully
translated if the address matches the Virtual Address Tag in the TLB
and the ASN register matches the ASN
field in the TLB. The ASN field can be seen as an "extension" of the
Virtual Address. This now means that different translations of
different processes can coexist in the TLB, avoiding the need to
flush the TLB's for context switches.
A global flag is available for data and code that is preferably
accessible for all processes, typically operating system related.
Global translations do not require the ASN fields to match. This
means that many processes can share a single entry in the TLB to
access global data. Another advantage of the ASN and global flag is that
flushing can be limited to specific entries whenever an invalidation
of the TLB is needed. Only the entries which have a certain
ASN or have the global bit set are flushed.
|
|
3.16 The TLB
Flush Filter
CAM
US Patent 6,510,508. |
|
The TLB's can be seen as caches
containing the translation information stored in the address
translation tables in memory. The actual translation requires
several levels of indirections through the tables stored in main
memory. This is the so-called "table walk"
A very time consuming process
which may take many hundreds of cycles for a single TLB entry. The
Opteron attempts to speed up the table walk with a 24 entry Page
Descriptor Cache.
Even so, it remains important to
avoid the table walk whenever possible in a multi-tasking
multi-threaded environment. A table walk becomes necessary whenever
entries in the TLB do not correspond to the memory resident
translations anymore because some- body has modified the
latter.
Until now there was only one way
to guarantee TLB coherency: Flush the TLB's if it may be
possible that any of the entries is not identical anymore to the
memory resident tables. Many actions in the x86 architecture result
in an automatic flush of the TLB's, often unnecessary. A new feature
in the Opteron: The TLB flush filter can avoid these costly flushing
in many occasions.
The TLB Flush filter is
implemented as a 32 entry, Content Addressable Memory ( CAM ). It
remembers the addresses of regions in memory that were accessed when
the TLB's were loaded. These regions thus belong to the Page
Translation Tables. The Filter then keeps monitoring all accesses to
memory to see if any of these regions are accessed again. If not
then it may disable the flushing of the TLB's because coherency is
guaranteed.
|
|
3.17 Data Cache
Snoop Interface |
|
The Snoop interface is used for a
wide variety of purposes. It's used to maintain Cache Coherency in a
multiprocessor system. It is used for conserving strict memory
ordering in shared memory, for Self Modifying Code detection, for
TLB coherency et-cetera.
The snoop interface uses the
physical addresses from other processor accesses as well as from
accesses issued on behalf of the instruction cache to probe various
memories and buffers for data that has somehow, something to do with
that particular address.
|
|
3.18 Snooping the Data Cache for Cache
Coherency, The MOESI protocol |
|
The Opteron can maintain cache
coherency in systems of up to 8 processors. It uses the so-called
MOESI protocol for this purpose. The
snoop interface plays a central role in the effectuation of the
protocol.
If a cache line is read from
system memory ( which may be connected to any of the eight
processors ), then the read has to snoop all the caches of all
processors. Snoop accesses are much smaller then normal memory
accesses because they do not carry the 64 byte cache line data. Many
snoops may therefore be active without overloading the distributed
memory system throughput. A snoop may find the cache-line in one of
the caches of another processor.
If a processor does not find the
cache-line in someone else's cache then it loads it from system
memory into its cache and marks it as Exclusive. Now whenever it writes something
in the cache-line then it becomes Modified. It does in general not write the modified cache-line
back to memory. It only does so if a special memory-page-attribute
tells it to do so (write through). The cache line will be evicted
only later on if another cache-line comes in which competes for the
same place in the cache.
If a processor needs to read from
memory and it finds the cache line in someone else's cache then it
will mark the cache line as Shared.
If the cache-line it finds in the other processors cache is
Modified then it will load it directly from
there instead of reading it from the memory which may be not up to
date. Cache to cache transfers are generally faster then memory
accesses.
The status of the cache-line in
the other cache goes from Modified to Owner.
This cache-line still isn't written back to memory. Any other
(third) processor that needs this cache-line from memory will find a
Shared version and a Owner version in the caches of the
first two processors. It will obtain the Owner version instead of reading it
from system memory. The owner is the latest who modified the
cache-line and stays responsible to update the system memory later
on.
A cache-line stays shared as long
as nobody modifies the cache-line again. If one of the processors
modifies it then it must let this know to the other processors by
sending an invalidate
probe throughout the
system. The state becomes Modified in this processor and Invalid in the other ones. If it
continues to write to the cache line then it does not have to send
anymore invalidate probes because the cache line isn't shared
anymore. It has taken over the responsibility to update the system
memory with the modified cache line whenever it must evict the
cache-line later on.
|
|
3.19 Snooping the Data Cache for Cache
Coherency, The Snoop Tag RAM |
|
Other processors that access
system memory need to snoop the Data Cache to maintain cache
coherency using the MOESI protocol. We saw that there were two kinds
of snoops. Read and Invalidate snoops. The basic task of a snoop is
first to establish if the Data Cache contains the cache-line in
question. There is a third set of Tags available specially for the
snoop interface. ( The other two are used for the two regular ports
of the data cache ). The snoop-Tag ram has 1024 entries, one for
each cache line. It holds the Physical address bits [39:12]
belonging to each cache line.
|
Virtual Address bit used to access the L1 data
Cache
|
virtual page address |
offset in page |
offset in cache line |
|
W |
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
Physical Address used to snoop the L1 data
Cache
|
physical page address |
offset in page |
offset in cache line |
|
15 |
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
The regular Tag rams are accessed
with the virtual address. The Snoop Tag ram however must deal with
the physical address ! Fortunately many of the virtual address bits
needed are identical to the physical address bits. Only bits [15:12]
are different and thus useless. This means that we must read out the
Tags of all 16 possible cache-lines in parallel and then test if
anyone of them matches. Luckily enough this doesn't present to much
of a burden. The total bus width (in bit-lines) of for instance the
cache rams is 512 bit. Sixteen times a 28 bit Tag is less
(448) so there's space left for some extra bits like the state info
for each cache-line.
Once we know which of the 16
possible cache-lines hits then we know also the remaining virtual
address bits needed to access the cache plus the Way (0 or 1) which
holds the cache-line. The position itself, ( 1 out of 16 ) directly
provides the 3 extra address bits plus the Way bit. This
means we can now access the cache if needed in case of a Read Snoop
hit.
|
|
3.20 Snooping the L1 Data Cache and
outstanding stores in LS2 |
|
It is not necessary for
snoop
reads from other
processors that want to read a cache-line from the L1 data cache to
check for retired stores in LS2 that will write to the cache-line
they are about to read. This even though the data these stores
will write is already considered to be part of the memory by the
processor who issued the writes. It's is OK for other
processors to see these writes occur at a later stage. The only
effect externally is that it looks as if the processor is slightly
slower.
An external processor that writes
to a shared cache line must send snoop invalidates around. The snoop interface will invalidate the
local cache-line if it receives such a snoop invalidate that hits
the cache. The snoop interface must also set the hit/miss flag
to miss for all stores in the Load Store unit that want to write to
the cache-line that was hit. The later is not a specific
snoop operation however. It is needed in all cases in which a
cache-line is evicted or invalidated. These stores that originally
did hit but who are set back to miss will need to probe the cache
again.
|
|
3.21 Snooping LS2 for loads to recover
strict memory ordering in shared memory US Patent 6,473,837. |
|
An interesting trick allows the
Opterons to handle speculative out of order loads from shared memory
and still observe the strict memory access ordering required for
shared memory multiprocessing. The hardware detects violations and
can restore strict memory ordering when needed. A communicating processor may for
instance first write a new command for another processor to A1 in
memory and then increment a value A2 to notify that it has issued
the next command. The processor which is supposed to handle
the commands may find the value A2 incremented but still reads the
old command from A1 if it executes loads out of order.
The ability to handle loads
out of
order can
significantly speed up processing. Most notable is the example where
a first load misses the cache. An out of order processor may issue
another load which may hit the cache without waiting for the result
of the first load. It would be beneficial to maintain
out-of-order loads in a multiprocessing environment.
Another important speed
improvement is speculative processing. The first load that missed may have been the counter
A2 in our example. The new command must be fetched if A2 has been
increased. A conditional call is made based on a test of the value
of A2. A speculative processor attempts to predict the outcome of
the branch at the beginning of the pipeline. It may predict that the
counter has been incremented if it generally takes more time to
execute the command than it takes to provide a new command.
That is: The new command is
generally sitting waiting to be executed by the time the previous
command has been executed.
The speculative out of order
processor may first attempt to load the counter A2, It may miss but
the branch predictor has predicted that it was increased and the
command from A1 will be loaded for execution. The load from A1 may
hit the cache. We actually do not know if this is a new commando or
not. Let say it is the old one. The counter A2 still has to be
loaded from memory. If A2 is increased in the mean time then the
load that missed will cause the modified cache-line to be loaded in
the local data cache with the incremented counter included. The
processor will conclude that the branch prediction was correct and
erroneously carry on with the old command.
The Opteron has a snoop mechanism
that allows this kind of fully speculative out-of-order processing
for high performance multi-processing. The mechanism detects cases
which may go wrong and consequently restores memory ordering. We'll
illustrate the mechanism with the use of our example.
When the first processor
writes a new commando into A1 then it will send a snoop- invalidate around to invalidate the
cache-line in all other caches. This snoop invalidate will also
reach the snoop interface of the Load Store unit:
The snoop interface first checks
the entries for a load that did hit the
cache-line-to-be-invalidated. This load would then be the "old
command" from A1 in our example. When it finds a load hit then
it continues by checking all older loads to see any of them is
marked as a miss. This would then be the load of the A2
counter value in our example. It marks the Snoop ReSync flag of all the load misses it
finds. This flag will cause any succeeding instructions to be
canceled when the load is retired including the instruction that
loads A1. The load of A1 will be re-executed and will now
correctly read the new command from memory.
|
|
3.22 Snooping the TLB Flush Filter
CAM |
|
Snooping is used to preserve
memory coherency. The function of the TLB flush filter is to prevent
unnecessary flushes of the TLB's. It does so by monitoring up
to 32 areas in memory that are known to contain page table
translation information which is cached in the TLB's. These
entries must be snooped also by snoop invalidates from other
processors that may write to the page tables of our processor.
If any of the snoops hits a TLB flush filter entry then we know that
a TLB may have invalid entries and that the TLB flush filter may not
prevent the flushing of the TLB's anymore.
The snoop-invalidates are not send
if a processor is sure that a cache-line is not shared with other
processors . This suggests that the TLB's (being caches in their own
right) participate in the MOESI protocol for cache coherency
via the TLB flush
filter. The memory
page translation tables ( PML4, PDP, PDE and PTE entries) may be in
cacheable memory. A special flag has to be set in the Opteron
if the Operating System decides to put the tables in un-cacheable
memory. (
TLBCACHEDIS in HWCR )
|
|
Chapter 4, Opteron's Instruction Cache
and Decoding
|
|
|
|
|
|
4.1 Instruction Cache: More then
instructions alone |
|
|
Access to the Instruction
cache is 128 bit wide. 16 bytes of instructions can be loaded
from the cache each cycle. The instruction bytes are
accompanied with an extra 76 bits of extra information. This
extends the total width of the Instruc- ion cache
port to 204 bits. We're still counting only the bits that cover
the full Instruction Cache. That is: Each of the 1024 cache
lines has its own set of these extra bits. There are several
more fields that have less then 1024 entries and are valid
only for a subset of the cache lines. |
|
|
|
Instruction only |
Total
size |
|
Instruction
Cache size: |
64
kByte |
102
kByte |
|
Cache Line
size |
64
Byte |
102
Byte |
|
One Read
Port |
128
bit |
204
bit |
|
One Write
Port |
128
bit |
204
bit | |
Well known are the three so-called
pre-decode
bits attached to each byte.
They mark the start and end points of the complex variable length
x86 instructions and provide some functional information. The other
two fields are the parity bits, 1 parity bit for each 16 data bits, and the so-called
branch
selectors. ( eight
times 2 bit for each 16 byte line of instruction code
).
|
|
|
Ram
Size |
Bus
Size |
Comments |
|
Instruction
Code: |
64
kByte |
128
bit |
16 bytes
instruction code |
|
Parity
bits |
4
kByte |
8
bit |
One parity bit
for each 16 bit |
|
Pre-decode |
26
kByte |
52
bit |
3 bits per byte
(start, end, function) + 4 bit per 16 byte
line |
|
Branch
Selectors |
8
kByte |
16
bit |
2 bits for each
2 bytes of instruction code |
|
TOTAL |
102
kByte |
204
bit |
|
|
The Opteron's branch selectors are
different from those of the Athlon (32) and they now cover all
1024 cache-lines of the Instruction Cache. The branch selectors
contain local branch
prediction
information which can not be retrieved as readily as for instance
the pre-decode information. A piece of code has to be executed
multiple times before the branch-selectors become
meaningful.
This is the reason that the branch
selector bits are saved together with the instruction data in the
unified level 2 cache whenever a cache-line is evicted from the
instruction cache. The branch selectors represent one bit extra for
each byte. The level 2 cache has this bit already for ECC ( Error
Coding and Correction ) information. ECC is only used for data cache
lines and not for instruction cache lines. The latter do not
need ECC, a few parity bits per cache line is sufficient.
Instruction cache lines that are corrupted can always be retrieved
from external DRAM memory. |
|
|
|
|
|
|
|
|
|
4.2 The General Instruction
Format |
|
A short overview of the 64 bit
instruction format:
A series of prefixes can precede
the actual instructions. At the start we have the legacy
prefixes. The most important legacy prefixes are the operand
size override prefix (hex 66) and the address size override prefix
(hex 67). These prefixes can change the length of the entire
instruction because they change the length of the displacement and
immediate fields which can be 1, 2 or 4 bytes long.
The REX prefix (hex 4X) is the new 64 bit prefix which brings us 64 bit
processing. The value of X is
used to extend the number of General Purpose registers and SSE
registers from 8 to 16. Three bits are used for this purpose because
x86 can specify up to three registers per instruction for data and
address calculations. The fourth bit is used as operand size
override (64 bit or default size) |
|
AMD64 Instruction Format

|
|
The Escape prefix (hex 0F) is used
to identify SSE instructions. The Opcode is the actual start of the
instruction after the prefixes. It can be either one or two bytes
and may have an optional MODRM byte and SIB byte. The optional
displacement and immediate fields can contain constants used for
address and data calculations and can be 1, 2 or 4 bytes. The total
length of the instruction is limited to 15 bytes.
|
|
4.3 The
Pre-decode bits |
|
Each byte in the instruction cache
is accompanied with 3 pre-decode bits generated by the pre-decoder.
These bits accelerate the decoding of the variable length
instructions. Each instruction byte has a start bit that is set when the byte is the
start of a variable length instruction and a similar end bit. Both bits are set in case of a
single byte instruction. More information is given with the third
bit, the function
bit. The decoders
look first at the function bit at the last byte of the variable
length instruction. If the function bit is 0 then the instruction is
a so-called direct
path instruction
which can be handled directly by the functional units. Otherwise if
the function bit is 1 at the end byte then the instruction is a
so-called vector path
instruction. A more
complex operation that needs to be handled by a microcode
program.
|
|
Definition of
the Instruction Pre-decode
bits |
|
START
bit |
1 indicates first byte of an
instruction |
|
END
bit |
1 indicates last byte of an instruction |
|
FUNCTION bit |
rule
1: Direct Path instruction
if 0 on the last byte
Vector Path instruction if 1 on the last
byte
rule
2: 1 indicates Prefix byte of Direct
Path (except last
byte)
0 indicates Prefix byte of Vector
Path (except last byte)
rule
3: For vector-path instructions
only:
if the function bit of the MODRM byte is set then
the instruction contains a SIB
byte. |
|
Then, secondly, the function bits
identify the prefix bytes. Ones identify prefix bytes of
direct path instructions and zeroes define the prefix bytes of
vector-path instructions. Then, finally, in case of vector-path
instructions only: if the function bit of the MODRM byte is set then
the instruction also contains a SIB byte.
|
|
4.4 Massively Parallel
Pre-decoding
US Patents
6,460,132 &
6,260,134 |
|
|
We find a very large block
of logic with fourfold symmetry directly near the position
were the 16 byte blocks of data are read and written from and
to the instruction cache.
We'll discuss the most
likely candidate here, A fourfold incarnation of an earlier
pre-decoder described in gate level detail in US
Patent 6,260,134
This fourfold version can,
according to the patent which describes it, pre- decode an
entire line of 16 bytes in only two cycles by means of what it
calls: massively parallel pre-decoding. This circumferences a
basic problem in variable length pre-decoding and decoding in
general, being:
A second instruction can not
be decoded until the length of the first instruction is known.
The start position of the second instruction depends on the
length of the first instruction.
The massively parallel
pre-decoder avoids this problem by first pre-decoding the 16
possible instructions in parallel. Each instruction starts at
one of the 16 byte locations of the 16 byte line. It then
filters out the real instructions with the help of the
program-counter which points to the start byte of the next
instruction, depending on where we jump into the 16 byte
line.
16 bytes of instructions can
be fetched per cycle from the instruction cache to be fed to
the decoders. It may be that the line is not yet pre-decoded
or wrongly pre-decoded. (Data bytes between instructions can
mislead the pre- decoder).
If a branch is made to an
address which does not have its pre-decode start bit set then
we know that something is wrong. The instruction pipeline may
invoke the pre- decoding hardware in this case to initialize
or correct the pre-decoding bits within only two
cycles. |
|
 |
The massively parallel pre-decoder
uses four blocks, these blocks are an adapted version of an earlier
pre-decoder. A single block pre-decodes four possible instructions
in parallel. Each instruction starting at one of four subsequent
byte positions. The old single block was capable of stepping through
a 16 byte line in four cycles. The massively parallel pre-decoder
combines four of them and uses a second stage to resolve the
relations between the four: The start / end fixer / sorter.
|
|
4.5
Large Workload Branch Prediction |
|
|
Branch
Prediction is the technique that makes it possible to design
pipelined processors. The outcome of a conditional branch is
generally only known at the very end of the pipeline while we
need to have this information at the very beginning of the
pipeline. We need the branch outcome to know which line of
instructions to load next.
The loading
of a line of instructions already takes two cycles. If we
don't want to loose anymore cycles then we must have decided
on a new instruction pointer at the end of the cycle when 16
instruction byte line arrives from the instruction
cache.
This means
that there is no time at all to even look at the instruction
bytes, to try to identify conditional branches, and then to
look up what the behavior was of these branches in recent
history in order to make a prediction. Doing this alone would
cost us several cycles. |
|
|
|
|
4.6 Improved Branch Prediction
|
|
The Branch prediction hardware
does not make any attempt to look at the fetched instruction bytes
at all. It uses several data structures instead to rapidly select a
new address. It has a 2048 entry Branch Target Buffer and a 12 entry Return Stack to select a next Program Counter
address. It further uses two branch history structures, one for
local and one for global history, It uses these branch history
structures to predict the outcome of the branches. The so-called
branch
selectors are used
for local history while the global history counters are used for global
history.
|
| 4.7 The Branch Selectors |
|
The branch selectors embody the local history. Local
means that the prediction is based on the history of the branch
itself alone. Conditional branches that are taken about always in
the same way can be predicted with the branch selectors.
unconditional branches are also handled by the branch selectors.
Remember that there is no time to look at the actual code. What a
branch selector says is that history has shown that a branch will be
encountered that is almost certainly taken, conditional or
unconditional.
Now if it's not so certain that a
branch will be taken? The branch selectors may leave the
prediction in this case to the global branch prediction. The branch selectors will
predict the branch as taken to identify the branch but leave the
final decision to the global history counters by setting the global flag.
|
| 16
byte line of instruction code |
| 0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
|
BS0 |
BS1 |
BS2 |
BS3 |
BS4 |
BS5 |
BS6 |
BS7 |
|
Branch Selection |
K7 Athlon 32 |
K8 Athlon
64 |
|
3 |
take branch 2 |
take branch 3 (or return) |
|
2 |
take branch 1 |
take branch 2 (or return) |
|
1 |
return |
take branch 1 (or return) |
|
0 |
continue to next
line |
continue to next
line |
|
Each 16 byte line of instruction
code is accompanied with eight 2 bit branch-selectors (some patents
talk about 9) The branch selector within the line is selected with
bits [3:1] of the Instruction Fetch address. The branch selector
answers the question: I did enter this 16 byte line on this
particular address, now what 16 byte line should I load in the next
cycle? A line
can have multiple jumps, calls, returns. They can be conditional or
unconditional. We may have jumped anywhere in the middle of all
these branches. The branch selectors tell us what to do depending on
where we entered the line.
The K7 can predict two branches
per line plus one return. The new 64 bit core can predict up to
three branches per line and anyone of them may be a return according
to Opteron's optimization manual. ( There are no patents yet
so the table above is our own extrapolation ). The branch selectors
are saved together with the instruction code in the large One
Megabyte L2 cache whenever a cache-line is evicted from the
instruction cache. The most useful data to save there is the
information which can't be easily retrieved from the instruction
code: the branch history. Information like the actual branch target address or the fact that the branch is a
return is retrieved relatively fast in
most cases by the processor.
|
|
4.8 The
Branch Target Buffer |
|
The BTB ( Branch Target Buffer )
contains 2048 addresses from which the branch selectors can choose
the next cycle's Instruction Fetch address. Fred Weber's
MPF2001 Hammer presentation shows us that each 16 byte line can now
have up to four branch target addresses to choose from ( Up from two
in the case of the Athlon 32 ). Each branch target entry is
shared between eight lines. From the branch selectors we know that
any single line may not use more then 3 of these. We assume that
when a branch selector says: Select the 2nd branch, This means the
second branch available for the current line.
|
|
Most
important Branch Target Buffer
fields |
|
Line Tag
( 3 bit )
|
15 bit
cache index |
Cache Way
Select, 0 or
1 |
Return
Instruction |
Use Global
Prediction |
Offset
in
Instr.
Code |
|
Field
|
Description
|
|
Line
Tag ( 3 bit
) |
A branch target
buffer entry is shared between 8 lines.
The line tag
tells us if this entry belongs to the current
line. |
|
15 bit cache index |
These 15 bits
are sufficient to access the two 32 kB ways
of the 64
kB 2 way set-associative Instruction
Cache |
|
cache way
select ( 0 or 1
) |
Used to check
the way of the cache.. |
|
Return
Instruction |
This bit tells
us to use the address from the return stack
instead to
access the next line in the instruction
cache. |
|
Use
Global Prediction |
The Global Flag
leaves the final Branch Prediction
to the Global
History Counters. |
|
Offset in Instruction
code |
This tells us
where the end of the branch is located in the
16 byte line of
instruction
code. |
|
Each branch target entry needs a 3
bit tag to identify to which of the 8 possible lines of instructions
it belongs. Sharing branch target entries strongly reduces the
amount of branch target addresses needed. 2048 entries still
would represent 12 kByte if the full 48 bit addresses were stored in
the BTB. This would be a relatively large memory which you won't
find on Opteron's die. The trick used here is to store only the 16
bits which are actually needed to access the 64 kByte instruction
cache. The higher address bits are retrieved later on. The Opteron
has a new unit called the BTAC ( Branch Target Address
Calculator ) to support this.
|
|
4.9 The Global History Bimodal
Counters |
|
The Athlon 64 has 16,384 branch
history counters. Four times as much as its 32 bit predecessor. The
counters describe the likelihood that a branch is taken. They count
up to a maximum of 3 when branches are taken and count down to a
minimum of 0 when not taken. Counter values 3 and 2 predict a
branch as taken, see the table.
|
|
Definition of
the 2 bit Branch History
Counters |
|
Counter
Value |
Branch Prediction |
|
counter = 3
|
Strongly
Taken |
|
counter = 2
|
Weakly
Taken |
|
counter = 1
|
Weakly not
Taken |
|
counter = 0
|
Strongly not
Taken |
|
The BHBC is accessed by using four
bits of the Program Counter and the outcome (taken or not taken)
from the last eight branches. This is basically the same as in the
Athlon 32. The fact that we now have four times as many counters
means that we have four branch predictors per 16 byte instruction
line. This corresponds with the four branch target addresses per
line. This would be an improvement over the Athlon 32 were the two
branches per line could interfere which each others branch
predictions.
|
|
Addressing
the Branch History
Counters |
|
Instruction
Address bits 7:4
| |
|
Branch outcome
of the
eight previous
branches
| |
|
$$$$ |
$$$$$$$$
|
|
16,384 Branch History Counters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Branch
prediction 0 |
Branch
prediction 1 |
Branch
prediction 2 |
Branch
prediction 3 |
|
Another improvement is that only branches whose global
bit was set participate in the global branch prediction. This
prevents branches
with a static behavior from polluting the global branch history. (
US Patent 6,502,188 describes this
in the context of the Athlon 32 ) The global bit is set
whenever a branch has a variable outcome. The GHBC table allows the
processors to predict global branch patterns of up to eight
branches.
|
|
4.10 Combined Local and Global Branch
Prediction with three branches per line |
|
A single 16 byte line with up to
three conditional branches represents a complex situation. If we
predict a first branch as not taken then we encounter the next
conditional branch which must be predicted also et-cetera. Does the
opteron handle this in multiple steps? or does it handle the whole
multiple branch prediction at once?
|
|
Local and Global
Branch Prediction with three
Branches per Line
|
|
IF
|
AND
|
THEN |
|
Branch
Selector
Selects Branch
1 |
Branch 1 is local,
or global and predicted taken |
TAKE BRANCH 0
|
|
Branch
Selector
Selects Branch
1 |
Branch 1 is global and predicted not taken and
Branch 2 is local, or global and predicted taken
|
TAKE BRANCH 1
|
|
Branch
Selector
Selects Branch
1 |
Branch 1 is global and predicted not taken and
Branch 2 is global and predicted not taken and
Branch 3 is local, or global and predicted taken
|
TAKE BRANCH 2
|
|
Branch
Selector
Selects Branch
1 |
Branch 1 is global and predicted not taken and
Branch 2 is global and predicted not taken and
Branch 3 is global and predicted not
taken |
GO
TO NEXT LINE |
|
Branch
Selector
Selects Branch
2 |
Branch 2 is local, or global and predicted taken
|
TAKE BRANCH 0
|
|
Branch
Selector
Selects Branch
2 |
Branch 2 is global and predicted not taken and
Branch 3 is local, or global and predicted taken
|
TAKE BRANCH 1
|
|
Branch
Selector
Selects Branch
2 |
Branch 2 is global and predicted not taken and
Branch 3 is global and predicted not taken
|
GO
TO NEXT LINE
|
|
Branch
Selector
Selects Branch
3 |
Branch 3 is local, or global and predicted taken
|
TAKE BRANCH 2
|
|
Branch
Selector
Selects Branch
3 |
Branch 3 is global and predicted not taken
|
GO
TO NEXT LINE
|
|
If we may take Fred Weber's
MPF2001 presentation as an indication here then we guess that it
takes the branches one step at a time. (The presentation shows a
single GHBC prediction per cycle ). A potential bottleneck may
indeed be the GHBC. A second and a third branch need a different "8
bit branch outcome" index into the table. The 8 bit value should be
shifted 1 and 2 positions further for the 2nd and the 3rd branch
with zeroes inserted to indicate "not taken" in order to operate
according the rules.
|
|
4.11
The Branch Target Address Calculator, Backup for the Branch
Target Buffer |
|
Another new improvement is the
BTAC, The Branch Target Address Generator, This unit is very useful
for several purposes. It can generate full (48 bit) branch addresses
two cycles after the 16 byte line of code has been loaded from the
cache. It works for most branches which typically use an 8 or 32 bit
displacement in the instruction to jump or call to code relative to
the program counter. The BTAC can probably identify return
instructions as well.
One task of the BTAC is a backup
function of the BTB (Branch Target Buffer). The BTB shares
each branch address with eight lines. We may find that the branch
selectors are OK but the branch target they select has been
overwritten by another branch. The branch selectors are maintained
for all cache-lines in the 64 kByte I Cache. They are also preserved
together with instruction cache-lines which are evicted from L1 to
the large 1 MegaByte L2 cache. It is unlikely that branch-selectors
which are reloaded from L2 into L1 still find their branch target
addresses in the BTB. On the contrary, the BTB entries should be
cleared whenever a cache-line is evicted from L1 to L2.
A cache-line that returns from L2
to L1 can restore the pre-decode bits rapidly (In two cycles with a
massively parallel pre-decoder) It has to restore the BTB entries as
well but this can take much more time. The Athlon 32 fills the BTB
with instruction-addresses that come back from the re-order buffer
when the branch is retired. This procedure would be repeated for
each branch in the 16 byte line when it is taken. It may well be
that the Athlon 64 still works this way. The BTAC can take over the
functionality of the BTB until the BTB entries are
restored.
The BTAC can use the lowest
Instruction Fetch address bits to see were we enter the 16 byte
line. It can then scan from that position to the first branch and
calculate the full 48 bit address by adding the 8 or 32 bit
displacement from the code. Now we have a calculated value which can
be used to index the cache. It is still a guessed address. The
certain address only comes when the branch instruction retires. The
BTAC may have picked the wrong branch for example.
We believe that the BTAC
calculates the full 48 bit address. We believe so because it can be
made to maintain the full 48 bit which has several advantages. The
48 bits would be lost whenever the BTB is used to predict an address
because it stores only a small portion of the address. The BTAC can
be used to maintain 48 bit because the BTB identifies the location
in the 16 byte line of the branch it uses. The BTAC can use this to
find the right branch and subsequently add the displacement to keep
the address at 48 bits.
There are two important tasks that
need the full 48 bit address. First: The branch-miss prediction test
hardware has to compare the full 48 bit "guess" address with the
actual 48 bit address as calculated by the branch instruction.
Secondly: The cache hit/miss test hardware needs the full 48 bit
"guess" address (virtual) to translate and compare it with the
(physical) address tag stored together with each
cache-line.
There are some patents without
BTAC that use a scheme of reversed TLB lookup to recover the full 48
bit (virtual) "guess" address from the (physical) cache tag and use
this for the branch miss prediction test. However such an address is
not useful for the cache hit-miss test ( It hits always!
).
|
|
4.12
Instruction Cache Hit / Miss detection, The Current Page
and BTAC |
|
The basic components for the
Instruction Cache hit/miss detection are basically the same as those
for the data cache. See section-3.3: "The Data
Cache Hit / Miss Detection: The cache tags and the primairy
TLB's" The single
port Instruction cache only needs a single tag ram and a single TLB.
The instruction cache also has a second level TLB ( see section-3.4) and it has its snoop tag ram (section-3.19). All these
structures are relatively simple to recognize on the
die-photo.
The current page register holds address bits [47:15] of the
"guessed" Instruction Fetch address. The BTB only stores the lower
15 Instruction Fetch address bits. The Fetch logic speculates that
the next 16 byte instruction line will be fetched from the same 32
kB page and that the upper address bits [47:15] remain the same.
Jumps and calls that cross the 32 kB border are miss predicted. The
higher bits of the fetch address [47:12] are needed for the cache
hit/miss logic. The virtual page address [47:12] is translated to a
physical page address [39:12] . This page address is then compared
to the two physical address tags read from the two way set
associative instruction cache to see if there is a hit in either
way.
The new BTAC ( Branch Target
Address Calculator) can recover the full 48 bit address from the
displacement field in the instruction code two cycles after the code
is fetched from the cache. This address can then be compared with
the current page register to check if the assumption that the branch
would not cross the 32 kB bounder was right. The cache hit/miss
logic in the mean time has translated and compared the guessed
address with the two instruction cache tags and produced the
hit/miss result.
|
|
Cache Hit / Miss and
Current Page Test
|
|
Cache Hit
|
Current Page OK
|
Continue
with the Instruction Line
Fetched
from the Instruction Cache |
|
Cache Hit
|
Current Page not
OK |
Re-access
the cache / TLB with
the
corrected Current Page |
|
Cache Miss
|
Current Page OK
|
Real
Cache Miss. Reload Cache-line
from L2
or memory. |
|
Cache Miss
|
Current Page not
OK |
Re-access the
cache / TLB with
the
corrected Current Page
|
|
The processor continues with the
16 instruction bytes fetched from the cache if there was a cache hit
and the 32 kB border was not crossed. The Fetch logic will re-access
the cache if the 32 kB border was crossed and will ignore the
hit/miss result in this case. If the 32 kB border was not crossed
and the TLB thus translated the right fetch address and there was a
cache miss then we may conclude that the cache miss was real and
that we have to reload the line from memory or L2. The BTAC
does not help in case of indirect branches. These still have to wait
until the correct address becomes available from the retired branch
instruction.
|
|
4.13 Instruction Cache Snooping
|
|
The Snoop interface of the
Instruction Cache is used to maintain Cache Coherency in a
multiprocessor environment and for Self Modifying Code
detection. Another processor that shares a cache-line with a
cache line in the Instruction cache sends snoop-invalidates
throughout the system when it writes into the shared cache-line. The
snoop interface checks if the snoop invalidate hits with a cache
line in the instruction cache. It will invalidate the line upon a
hit. The snoop interface works with physical addresses as described
in section 3.19
The instruction cache can share
cache-lines with other processors. It can not share a cache-line
with its own data cache however. The latter is forbidden because the
processor must correctly handle Self Modifying Code programs. The
Instruction and data cache are exclusive to each other as well as to
the unified level 2 cache. The snoop interface detects if a
cache-line load for the data cache hits a cache-line in the
instruction cache and invalidates the cache-line upon a
hit.
The instruction cache may share a
cache-line with a data-cache on another processor. This so-called
Cross Modifying Code case is less stringent. The exact moment at
which the other processor overwrites the instruction code is
uncertain. The only effect of a shared cache-line which is modified
by another processor is that we see the modification somewhat later,
as if the other processor was slightly slower.
Interesting is that the new ASN
(Address Space Number) could make it possible for the instruction
cache and data cache to share cache lines as long as they are
assigned to different processes with different ASN's. This would be
similar to the cross modifying case mentioned above. The hardware
however does not support it because the ASN's are not stored
together with the cache lines. It would not be worth the trouble
anyway from a performance point of view.
|
|
Athlon 64, Bringing
64 bits to the x86 Universe
|
|
Regards, Hans
|
|
|