This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:multiasm:papc:chapter_6_11 [2026/02/19 10:20] – [SSE4] ktokarz | en:multiasm:papc:chapter_6_11 [2026/02/27 02:40] (current) – jtokarz | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== MMX, SSE and AVX Extensions ====== | ||
| + | At some point of personal computers' | ||
| + | |||
| + | ===== MMX ===== | ||
| + | MMX set of instructions operates on 64-bit packed data types. Packed means that the 64-bit data can contain 8 bytes, 4 words, or 2 doublewords. Based on this, the new data types were defined. Packed data types are also called vectors. Please refer to the section " | ||
| + | ==== Data transfer ==== | ||
| + | To copy data from memory or between registers, two new data transfer instructions were introduced. | ||
| + | The **movd** instruction allows copying 32 bits of data between MMX registers and memory or between MMX registers and general-purpose registers of the main processor. The **movq** instruction allows copying 64 bits of data between MMX registers and memory or between two MMX registers. In all MMX instructions except data transfer, the first operand, which is a destination operand, is always an MMX register. | ||
| + | < | ||
| + | In modern 64-bit processors, the **movq** instruction is extended to copy 64-bit data between MMX registers and general-purpose registers of the main processor. | ||
| + | </ | ||
| + | ==== Basic vector calculations ==== | ||
| + | The main idea of vector data processing is shown in figure {{ref> | ||
| + | <figure mmxprocessing> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | When performing arithmetic operations, the main processor stores additional information in flags in the FLAG register. The MMX unit does not have flags for each calculated result, so some other approach should be used. Key information for arithmetic operations is the carry when adding and the borrow when subtracting. The simplest solution is to omit the carry, which, if the maximum value is exceeded, will result in truncation of the oldest bits and a reduction in the result. In the case of subtraction, | ||
| + | |||
| + | <figure mmxpaddw> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | <figure mmxpaddsw> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | <figure mmxpaddusw> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The last letter in the instruction specifies the size of arguments and results. MMX addition and subtraction instructions are shown in the table {{ref> | ||
| + | <table mmxaddsub> | ||
| + | < | ||
| + | ^ Mnemonic ^ operation ^ argument size ^ overflow management ^ | ||
| + | | **paddb** | addition | 8 bytes | wraparound | | ||
| + | | **paddw** | addition | 4 words | wraparound | | ||
| + | | **paddd** | ||
| + | | **paddsb** | addition | 8 bytes | signed saturation | | ||
| + | | **paddsw** | addition | 4 words | signed saturation | | ||
| + | | **paddusb** | addition | 8 bytes | unsigned saturation | | ||
| + | | **paddusw** | addition | 4 words | unsigned saturation | | ||
| + | | **psubb** | subtraction | 8 bytes | wraparound | | ||
| + | | **psubw** | subtraction | 4 words | wraparound | | ||
| + | | **psubd** | ||
| + | | **psubsb** | subtraction | 8 bytes | signed saturation | | ||
| + | | **psubsw** | subtraction | 4 words | signed saturation | | ||
| + | | **psubusb** | subtraction | 8 bytes | unsigned saturation | | ||
| + | | **psubusw** | subtraction | 4 words | unsigned saturation | | ||
| + | </ | ||
| + | |||
| + | Multiplication operation requires twice as much space for the result as the size of the arguments. The solution for this issue is to split the operation into two multiplication instructions, | ||
| + | |||
| + | <figure mmxmultiplyandunpck> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The code that calculates the presented multiplication can look as follows: | ||
| + | <code asm> | ||
| + | Numbers DW 01ACh, 2112h, 03F3h, 00A4h, | ||
| + | 0006h, 0137h, 0AB7h, 00D8h | ||
| + | LEA ESI, Numbers | ||
| + | MOVQ mm0, [ESI] ; mm0 = 00A4 03F3 2112 01AC | ||
| + | MOVQ mm1, [ESI+8] | ||
| + | MOVQ mm2, mm0 | ||
| + | PMULLW | ||
| + | PMULHW | ||
| + | MOVQ mm2, mm0 | ||
| + | PUNPCKLWD | ||
| + | PUNPCKHWD | ||
| + | </ | ||
| + | |||
| + | ==== Advanced calculations ==== | ||
| + | The MMX set of instructions also contains the **pmaddwd** (multiply and add packed words to doublewords) instruction. It computes the products of the corresponding signed word operands. The four intermediate doubleword products are summed in pairs to produce two doubleword results. Its behaviour is shown in figure {{ref> | ||
| + | |||
| + | <figure mmxpmaddw> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ==== Comparison ==== | ||
| + | The set of comparison instructions allows for comparing values in two vectors. The result is stored as a mask of bits, with all ones at the element of the vector where the comparison result is true, and all zeros in the opposite case. There are six compare instructions as shown in table {{ref> | ||
| + | <table mmxtabcompare> | ||
| + | < | ||
| + | ^ Mnemonic ^ comparison type ^ argument size ^ | ||
| + | | **pcmpeqb** | equal | 8 bytes | | ||
| + | | **pcmpeqw** | equal | 4 words | | ||
| + | | **pcmpeqd** | equal | 2 doublewords | | ||
| + | | **pcmpgtb** | greater than | 8 bytes | | ||
| + | | **pcmpgtw** | greater than | 4 words | | ||
| + | | **pcmpgtq** | greater than | 2 doublewords | | ||
| + | </ | ||
| + | An example of comparison instruction for equality of two vectors of words is shown in figure {{ref> | ||
| + | <figure mmxcompare> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ==== Data conversion ==== | ||
| + | The unpack instructions presented in figure {{ref> | ||
| + | |||
| + | <figure mmxpunpckhbw> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | <figure mmxpunpcklbw> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The pack instructions are used to shrink the size of arguments and pack them into smaller data. Only three pack instructions are implemented in MMX extension: **packsswb** - pack words into bytes with signed saturation, **packssdw** - pack doublewords into words with signed saturation, and **packuswb** - pack words into bytes with unsigned saturation. The example of pack instruction is shown in figure {{ref> | ||
| + | |||
| + | <figure mmxpack> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ==== Shift ==== | ||
| + | Packed shift instructions perform shift operations and elements of the specified size. All elements of the vector are shifted separately. In a logical shift, empty bits are filled with zeros, in arithmetical shift right, the higher bit is copied to preserve the sign of values. | ||
| + | There are eight shift instructions, | ||
| + | <table mmxshift> | ||
| + | < | ||
| + | ^ Mnemonic ^ operation ^ argument size ^ type of shift ^ | ||
| + | | **psllw** | shift left | 4 words | logical | | ||
| + | | **pslld** | shift left | 2 doublewords | logical | | ||
| + | | **psllq** | shift left | 1 quadword | logical | | ||
| + | | **psrlw** | shift right | 4 words | logical | | ||
| + | | **psrld** | shift right | 2 doublewords | logical | | ||
| + | | **psrlq** | shift right | 1 quadword | logical | | ||
| + | | **psraw** | shift right | 4 words | arithmetic | | ||
| + | | **psrad** | shift right | 2 doublewords | arithmetic | | ||
| + | </ | ||
| + | |||
| + | ==== Logical ==== | ||
| + | MMX logical instructions operate on the 64-bit data as a whole. They perform bitwise operations as shown in table {{ref> | ||
| + | <table mmxlogical> | ||
| + | < | ||
| + | ^ Mnemonic ^ operation ^ | ||
| + | | **pand** | AND | | ||
| + | | **pnand** | AND NOT | | ||
| + | | **por** | OR | | ||
| + | | **pxor** | XOR | | ||
| + | </ | ||
| + | |||
| + | ==== Co-existence of FPU and MMX ==== | ||
| + | MMX instructions use the same physical registers as the FPU. As a result, mixing FPU and MMX instructions in the same fragment of the code is not possible. Switching between FPU and MMX in a code requires executing the **emms** instruction, | ||
| + | |||
| + | ===== SSE ===== | ||
| + | The SSE is a large group of instructions that implement the SIMD processing towards floating point calculations and increase the size and number of the registers. The abbreviation SSE comes from the name Streaming SIMD Extensions. As the number of instructions introduced in all SSE versions exceeds a few hundred, in this section, we will present the general overview of each SSE version and detailed information on some chosen interesting instructions. | ||
| + | The first group of SSE instructions defines a new vector data type containing four single-precision floating-point numbers. It's easy to calculate that it requires the 128-bit registers. These new registers are named XMM0 - XMM7 and are separated from any previously implemented registers, so SSE floating-point operations do not conflict with MMX and FPU. | ||
| + | ==== Data transfer ==== | ||
| + | |||
| + | In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, | ||
| + | |||
| + | ==== Calculations ==== | ||
| + | The SSE implements the vector and scalar calculations on single-precision floating-point numbers. No prefix for instruction names operating on floating-point numbers was added, but the mnemonic suffix describes the type. PS (packed single) - action on vectors, | ||
| + | SS (scalar single) - operation on scalars. If the instructions operate on halves of XMM registers (i.e. either refer to bits 0..63 or 64..127), the instruction mnemonics contain the letter L or H. | ||
| + | The idea of vector and scalar operations is shown in figure {{ref> | ||
| + | <figure sse1vector> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | <figure sse1scalar> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | In the SSE extension, mathematical calculations on single-precision floating-point numbers are implemented in both vector (packed) and scalar versions. These instructions are summarised in table {{ref> | ||
| + | <table ssemath> | ||
| + | < | ||
| + | ^ Mnemonic ^ operation ^ argument type ^ | ||
| + | | **addps** | addition | vector | | ||
| + | | **addss** | addition | scalar | | ||
| + | | **subps** | subtraction | vector | | ||
| + | | **subss** | subtraction | scalar | | ||
| + | | **mulps** | multiplication | vector | | ||
| + | | **mulss** | multiplication | scalar | | ||
| + | | **divps** | division | vector | | ||
| + | | **divss** | division | scalar | | ||
| + | | **rcpps** | reciprocal | vector | | ||
| + | | **rcpss** | reciprocal | scalar | | ||
| + | | **sqrtps** | square root | vector | | ||
| + | | **sqrtss** | square root | scalar | | ||
| + | | **rsqrtps** | reciprocal of square root | vector | | ||
| + | | **rsqrtss** | reciprocal of square root | scalar | | ||
| + | | **maxps** | maximum (bigger) value | vector | | ||
| + | | **maxss** | maximum (bigger) value | scalar | | ||
| + | | **minps** | minimum (smaller) value | vector | | ||
| + | | **minss** | minimum (smaller) value | scalar | | ||
| + | </ | ||
| + | |||
| + | ==== Comparison ==== | ||
| + | Besides math calculations, | ||
| + | |||
| + | <table ssecompare> | ||
| + | < | ||
| + | ^ Pseudoinstruction ^ operation ^ instruction ^ | ||
| + | | **cmpeqss** xmm1, xmm2 | equal | cmpss xmm1, xmm2, 0 | | ||
| + | | **cmpltss** xmm1, xmm2 | less then | cmpss xmm1, xmm2, 1 | | ||
| + | | **cmpless** xmm1, xmm2 | less or equal | cmpss xmm1, xmm2, 2 | | ||
| + | | **cmpunordss** xmm1, xmm2 | unordered | cmpss xmm1, xmm2, 3 | | ||
| + | | **cmpneqss** xmm1, xmm2 | not equal | cmpss xmm1, xmm2, 4 | | ||
| + | | **cmpnltss** xmm1, xmm2 | not less then | cmpss xmm1, xmm2, 5 | | ||
| + | | **cmpnless** xmm1, xmm2 | not less or equal | cmpss xmm1, xmm2, 6 | | ||
| + | | **cmpordss** xmm1, xmm2 | ordered | cmpss xmm1, xmm2, 7 | | ||
| + | </ | ||
| + | |||
| + | With the use of the **comiss** instruction, | ||
| + | |||
| + | ==== Logical instructions ==== | ||
| + | There are four logical instructions which operate on all 128 bits of the XMM register. These are **andps**, **andnps**, **orps**, **xorps**. It is rather clear that the functions are logical and, logical and not, logical or and logical xor, respectively. | ||
| + | |||
| + | ==== Data shuffle ==== | ||
| + | Two unpack instructions, | ||
| + | |||
| + | <figure sseunpack> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The more universal instruction is **shufps**. It selects two out of four single-precision values from the source argument and rewrites them to the bottom half of the destination argument. The upper half is filled with two single-precision values from the destination register. Which values will be taken is determined by the third, 8-bit immediate argument. Each two-bit field of the immediate determines the number of packed single values. For 11 - it is X3 or Y3, for 10 - X2 or Y2, for 01 - X1 or Y1 and for 00 - X0 or Y0. It is presented in figure {{ref> | ||
| + | |||
| + | <figure sseshuffle> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | ==== Other instructions ==== | ||
| + | Together with new data registers, an additional control register appeared in the processor. It is named MXCSR and is similar in meaning of flags to the FPU control register. New instructions are implemented, | ||
| + | Some additional MMX instructions were introduced together with the SSE extension. | ||
| + | In the SSE, the first set of data conversion instructions was implemented. The summary of all these instructions will be presented in the following chapter. Also, cache supporting instructions were added. These instructions will be described in the chapter on optimisation. | ||
| + | |||
| + | ===== SSE2 ===== | ||
| + | The SSE2 set of instructions implements integer vector operations using the XMM registers. In general, the same instruction mnemonics defined for MMX can be used with XMM registers, offering twice as long vectors. Additionally, | ||
| + | In the SSE2 extension, the denormals-are-zeros mode was introduced. The processor automatically converts all unnormalized floating-point arguments to zero. In such a case, a flag indicating the denormalised argument is not set, and an exception is not raised. The denormals-are-zeros mode is not compliant with IEEE Standard 754, but it allows implementation of faster algorithms for advanced audio and video processing. | ||
| + | The arithmetic instructions are similar to SSE, but they possess the suffix of **pd** - packed double or **sd** - scalar double instead of **ps** and **ss**, respectively. | ||
| + | |||
| + | ==== Conversion ==== | ||
| + | In figure {{ref> | ||
| + | |||
| + | <figure sse2conversions> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ===== SSE3 ===== | ||
| + | The SSE3 is a set of 13 instructions. The main innovation in SSE3 is the implementation of horizontal instructions. These instructions perform calculations on the elements of a vector within the same register. There are four such instructions. The **haddpd** performs horizontal addition of double-precision values, the **hsubpd** is a horizontal subtraction of double-precision values, the **haddps** is a horizontal addition of single-precision values, and the **hsubps** is a horizontal subtraction of single-precision values. | ||
| + | All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand' | ||
| + | <figure sse3hsubpd> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | While there are more than two elements of source vectors, like in the **hsubps** instruction, | ||
| + | <figure sse3hsubps> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ===== SSSE3 ===== | ||
| + | This abbreviation stands for Supplemental Streaming SIMD Extension 3. It is a set of 16 instructions introduced in the Core 2 architecture. It implements integer horizontal operations on XMM registers. The principles are the same as in horizontal instructions in SSE3, but instructions can process vectors of doublewords or words. They are summarised in the table {{ref> | ||
| + | |||
| + | <table SSSE3horizontaltable> | ||
| + | < | ||
| + | ^ Instruction | ||
| + | | **phaddd** | ||
| + | | **phaddw** | ||
| + | | **phaddsw** | ||
| + | | **phsubd** | ||
| + | | **phsubw** | ||
| + | | **phsubsw** | ||
| + | </ | ||
| + | |||
| + | Two data shuffle instructions are worth mentioning. The **pshufb** instruction makes copies of bytes from the first 128-bit operand based on the control information taken from the second 128-bit operand. Each byte in the control operand determines the resulting byte in the respective position. | ||
| + | * bit 7 is 1 - byte is cleared | ||
| + | * bit 7 is 0 - byte contains a copy of the source byte | ||
| + | * bits 0-3 - a number of the source byte to be copied | ||
| + | The illustration is shown in figure {{ref> | ||
| + | <figure sse3pshufb> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The **palignr** instruction combines bytes from two source operands as shown in figure {{ref> | ||
| + | |||
| + | <figure sse3palignr> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ===== SSE4 ===== | ||
| + | The SSE4 is composed of SSE4.1 and SSE4.2. These groups include instructions supplementing previous extensions. For example, there are eight instructions which expand support for packed integer minimum and maximum determination, | ||
| + | The **dpps** and **dppd** | ||
| + | <figure sse4dotproduct> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. The type of the data is specified with the suffix of the mnemonic: b - bytes, w - words, d - doublewords, | ||
| + | |||
| + | The blending instructions copy elements of vectors, mixing two sources into the destination. The **blendps**, | ||
| + | <figure sse4blendpd> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The instructions **blendvps**, | ||
| + | <figure sse4blendvpd> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The set of extract instructions includes **pextrb**, **pextrw**, **pextrd**, **pextrq** and **extractps**. They take one element of the vector from the XMM register and store it in a CPU register or in memory. The offset of the element is specified with an immediate constant. The behaviour of **extractps** is shown in fig. {{ref> | ||
| + | <figure sse4extractps> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The insert instructions are **pinsrb**, **pinsrd** and **pinsrq**. They operate in an opposite way to extract instructions. They take an element from memory or a general-purpose register and insert it into the XMM register at the position specified with a constant immediate. The behaviour of **pinsrd** is shown in fig. {{ref> | ||
| + | <figure sse4pinsrd> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | |||
| + | The **insertps** is one of the most complex. inserts a scalar single-precision floating-point value with the position of the vector' | ||
| + | <figure sse4insertps> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of string instructions. There are four string compare instructions (see table {{ref> | ||
| + | <table sse4stringtable> | ||
| + | < | ||
| + | ^ Instruction ^ length ^ type of the result ^ | ||
| + | | **pcmpestri** | explicit | index | | ||
| + | | **pcmpestrm** | explicit | mask | | ||
| + | | **pcmpistri** | implicit | index | | ||
| + | | **pcmpistrm** | implicit | mask | | ||
| + | </ | ||
| + | |||
| + | The third, immediate operand encodes the comparison method and result encoding. | ||
| + | |||
| + | <table SSE4stringdata> | ||
| + | < | ||
| + | ^ bits 1:0 ^ data type ^ | ||
| + | | 00 | unsigned BYTE | | ||
| + | | 01 | unsigned WORD | | ||
| + | | 10 | signed BYTE | | ||
| + | | 11 | signed WORD | | ||
| + | </ | ||
| + | |||
| + | <table SSE4stringcomparisonmethod> | ||
| + | < | ||
| + | ^ bits 3:2 ^ operation ^ comment ^ | ||
| + | | 00 | Equal Any | find any of the specified characters in the input string | | ||
| + | | 01 | Ranges | check if characters are within the specified ranges | | ||
| + | | 10 | Equal Each | check if the input strings are equal | | ||
| + | | 11 | Equal Ordered | check if the needle string is in the haystack string | | ||
| + | </ | ||
| + | The SSE4.2 string compare instructions are advanced, powerful means for processing byte or word strings. The detailed explanation of SSE4.2 string instructions behaviour together with illustrations can be found on ((https:// | ||
| + | |||
| + | ===== AVX ===== | ||
| + | AVX is the abbreviation of Advanced Vector Extensions. The AVX implements larger 256-bit YMM registers as extensions of XMM. In 64-bit processors number of YMM registers is increased to 16. Many SSE instructions are expanded to handle operations with new, bigger data types without modification of mnemonics. The most important improvement in the instruction set of x64 processors is the implementation of RISC-like instructions in which the destination operand can differ from two source operands. A three-operand SIMD instruction format is called the VEX coding scheme. The AVX2 extension implements more SIMD instructions for operation with 256-bit registers. The AVX-512 extends the register size to 512 bits. An interesting, | ||