Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
en:multiasm:cs:chapter_3_9 [2026/01/10 10:56] pczekalskien:multiasm:cs:chapter_3_9 [2026/01/10 20:16] (current) pczekalski
Line 1: Line 1:
 ====== Modern Processors: Cache, Pipeline, Superscalar, Branch Prediction, Hyperthreading ====== ====== Modern Processors: Cache, Pipeline, Superscalar, Branch Prediction, Hyperthreading ======
  
-Modern processors have a very complex design and include many units that primarily reduce the execution time of software+Modern processors have a highly complex design and include many units that primarily reduce software execution time. 
  
 ===== Cache ===== ===== Cache =====
  
-Cache memory is a layer in the memory hierarchy that sits between main memory and processor registers. The main reason for introducing cache memory is that main memory, based on DRAM technology, is much slower than the processor, which is based on static technology. The cache exploits two software features: spatial locality and temporal locality. Spatial locality results from the fact that the processor executes code, which, in most cases, is a sequence of instructions arranged directly one after another. Temporal locality arises because programs often run in loops, repeatedly working on a single set of data over short intervals. In both cases, a larger fragment of a program or data can be loaded into the cache and operated on without accessing main memory each time. Main memory is designed to significantly speed up reading and writing data in blocks compared to accessing random addresses. These properties allow a code fragment to be read in its entirety from main memory into the cache and executed without the need to access RAM for each instruction separately. For data, the processor performs calculations after reading a block into the cache, then stores the results in a single write sequence.+Cache memory is a layer in the memory hierarchy that sits between main memory and processor registers. The main reason for introducing cache memory is that main memory, based on DRAM technology, is much slower than the processor, which is based on static technology. The cache exploits two software features: spatial locality and temporal locality. Spatial locality results from the fact that the processor executes code, which, in most cases, is a sequence of instructions arranged directly one after another. Temporal locality arises because programs often run in loops, repeatedly working on a single set of data over short intervals. In both cases, a larger fragment of a program or data can be loaded into the cache and operated on without accessing main memory each time. Main memory is designed to significantly speed up reading and writing data in blocks compared to accessing random addresses. These properties allow a code fragment to be read in its entirety from main memory into the cache and executed without accessing RAM for each instruction. For data, the processor performs calculations after reading a block into the cache, then stores the results in a single write sequence.
  
-In modern processors, the cache is divided into several levels, usually three. The first-level cache (L1) is the closest to the processor, the fastest, and is usually divided into separate instruction and data caches. The second-level cache (L2) is shared, slower and usually larger than the L1 cache. The largest and the slowest is the third-level cache (L3). It is closest to the computer's main memory. An average latency of L1 cache is 1-2 ns, L2 has a latency of 3-5 ns, and L3 has a latency of about 10-20 ns. An example i7 processor, for each core, has 32 KB of L1 code cache, 32 KB of L1 data cache, and 256 KB of L2 cache for both code and data. The third-level cache (L3) is 8 MB and is shared by all cores.+In modern processors, the cache is divided into several levels, usually three. The first-level cache (L1) is the closest to the processor, the fastest, and is usually divided into separate instruction and data caches. The second-level cache (L2) is shared, slower and usually larger than the L1 cache. The largest and the slowest is the third-level cache (L3). It is closest to the computer's main memory. An average L1 cache latency is 1-2 ns, L2 is 3-5 ns, and L3 is 10-20 ns. An example i7 processor, for each core, has 32 KB of L1 code cache, 32 KB of L1 data cache, and 256 KB of L2 cache for both code and data. The third-level cache (L3) is 8 MB and is shared by all cores.
  
 Besides size, important cache parameters are line length and associativity.  Besides size, important cache parameters are line length and associativity. 
Line 27: Line 27:
 </figure> </figure>
  
-Modern processors implement longer pipelines. For example, the Pentium III used a 10-stage pipeline, the Pentium 4 a 20-stage pipeline, and the Pentium 4 Prescott even a 31-stage pipeline. Does the longer pipeline mean faster program execution? Everything has benefits and drawbacks. The undoubted benefit of a longer pipeline is that more instructions can be executed at the same time, yielding higher instruction throughput. But the problem appears when branch instructions come. While a conditional jump appears in the instruction stream, the processor must choose which way the stream should follow. Should the jump be taken or not? The answer is usually based on the result of the preceding instruction and is known when the branch instruction is close to the end of the pipeline. In such a situation, in modern processors, the branch prediction unit guesses how to handle the branch. If it misses, the pipeline content is invalidated, and the pipeline starts operation from the beginning. This causes stalls in the program execution. If the pipeline is longer, the number of instructions to invalidate is bigger. In modern microarchitectures, the pipeline length ranges from 12 to 20.+Modern processors implement longer pipelines. For example, the Pentium III used a 10-stage pipeline, the Pentium 4 a 20-stage pipeline, and the Pentium 4 Prescott even a 31-stage pipeline. Does the longer pipeline mean faster program execution? Everything has benefits and drawbacks. The undoubted benefit of a longer pipeline is that more instructions can be executed simultaneously, yielding higher instruction throughput. But the problem appears when branch instructions come. While a conditional jump appears in the instruction stream, the processor must choose which way the stream should follow. Should the jump be taken or not? The answer is usually based on the result of the preceding instruction and is known when the branch instruction is close to the end of the pipeline. In such a situation, in modern processors, the branch prediction unit guesses how to handle the branch. If it misses, the pipeline content is invalidated, and the pipeline starts operation from the beginning. This causes stalls in the program execution. If the pipeline is longer, the number of instructions to invalidate is bigger. In modern microarchitectures, the pipeline length ranges from 12 to 20.
  
  
 ===== Superscalar ===== ===== Superscalar =====
-The superscalar processor increases the speed of program execution by executing more than one instruction per clock cycle. It is realised by simultaneously dispatching instructions to different execution units on the processor. The superscalar processor can, but doesn't have to, implement two or more independent pipelines. Rather, decoded instructions are sent to the chosen execution unit for further processing, as shown in Fig. {{ref>superscalar}}.+The superscalar processor increases program execution speed by executing more than one instruction per clock cycle. It is realised by simultaneously dispatching instructions to different execution units on the processor. The superscalar processor can, but doesn't have to, implement two or more independent pipelines. Rather, decoded instructions are sent to the chosen execution unit for further processing, as shown in Fig. {{ref>superscalar}}.
  
 <figure superscalar> <figure superscalar>
Line 38: Line 38:
 </figure> </figure>
 <todo @ktokarz>i7 -> Intel i7?</todo> <todo @ktokarz>i7 -> Intel i7?</todo>
-In the x86 family, the first processor with two paths of execution was the Pentium, which had two execution units called U and V. Modern x64 processors like i7 implement six execution units. Not all execution units have the same functionality. For example, in the i7 processor, each execution unit has different capabilities, as shown in table {{ref>executionunits}}.+In the x86 family, the first processor with two execution paths was the Pentium, which had two execution units called U and V. Modern x64 processors like i7 implement six execution units. Not all execution units have the same functionality. For example, in the i7 processor, each execution unit has different capabilities, as shown in table {{ref>executionunits}}.
  
 <table executionunits> <table executionunits>
en/multiasm/cs/chapter_3_9.1768035390.txt.gz · Last modified: 2026/01/10 10:56 by pczekalski
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0