Differences

This shows you the differences between two versions of the page.

--- en:multiasm:papc:chapter_6_11 [2026/02/19 10:20] – [SSE4] ktokarz
+++ en:multiasm:papc:chapter_6_11 [2026/02/27 02:40] (current) – jtokarz
@@ Line 1: / Line 1: @@
+====== MMX, SSE and AVX Extensions ======
+At some point of personal computers' evolution, it became clear that they would be used not only for professional use, for example, in companies, financial institutions, and education, but also would be used as centres of home entertainment systems, enabling users to play games, watch videos, and listen to music. This led to empowering processors with the ability to process multimedia data. As the stereo sound has the form of a series of samples, and pictures are often represented by a matrix of three colour pixels, the method of improving the performance of multimedia processing is to introduce parallelism. At the processor level, the answer is SIMD - Single Instruction Multiple Data, which allows the execution unit to perform the same operation on many data units at the same time. Speaking more formally, one stream of instructions performs operations on many data streams. The first SIMD instructions introduced in the x86 family that follow this idea are MMX - MultiMedia eXtension.
+===== MMX =====
+MMX set of instructions operates on 64-bit packed data types. Packed means that the 64-bit data can contain 8 bytes, 4 words, or 2 doublewords. Based on this, the new data types were defined. Packed data types are also called vectors. Please refer to the section "Integer vector data types" for details. The MMX instructions operate using eight 64-bit registers named MM0 - MM7.
+==== Data transfer ====
+To copy data from memory or between registers, two new data transfer instructions were introduced.
+The **movd** instruction allows copying 32 bits of data between MMX registers and memory or between MMX registers and general-purpose registers of the main processor. The **movq** instruction allows copying 64 bits of data between MMX registers and memory or between two MMX registers. In all MMX instructions except data transfer, the first operand, which is a destination operand, is always an MMX register.
+<note>
+In modern 64-bit processors, the **movq** instruction is extended to copy 64-bit data between MMX registers and general-purpose registers of the main processor.
+</note>
+==== Basic vector calculations ====
+The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data.
+<figure mmxprocessing>
+{{ :en:multiasm:cs:mmxprocessing.png?550 |Illustration of the idea of vector data processing}}
+<caption>The idea of vector data processing</caption>
+</figure>
+When performing arithmetic operations, the main processor stores additional information in flags in the FLAG register. The MMX unit does not have flags for each calculated result, so some other approach should be used. Key information for arithmetic operations is the carry when adding and the borrow when subtracting. The simplest solution is to omit the carry, which, if the maximum value is exceeded, will result in truncation of the oldest bits and a reduction in the result. In the case of subtraction, the situation is reversed, and the resulting value will be larger than expected. For multimedia operations, a better solution is to limit the result to a maximum or minimum value. This approach is called saturation and comes in signed and unsigned versions. This means that, for example, when a pixel reaches its highest brightness, it will no longer be brightened. This way, information about brightness differences is lost, but the resulting image looks natural. Let's consider the addition operation on four arguments of the size of a word in three versions. In figure {{ref>mmxpaddw}}, the packed word addition with wraparound **paddw** is shown. In figure {{ref>mmxpaddsw}}, the packed word addition with signed saturation **paddsw** is presented, and finally, the packed word addition with unsigned saturation **paddusw** is shown in figure {{ref>mmxpaddusw}}.
+<figure mmxpaddw>
+{{ :en:multiasm:cs:mmxpaddw.png?500 |Illustration of packed word addition with wraparound}}
+<caption>The illustration of packed word addition with wraparound</caption>
+</figure>
+<figure mmxpaddsw>
+{{ :en:multiasm:cs:mmxpaddsw.png?450 |Illustration of packed word addition with signed saturation}}
+<caption>The illustration of packed word addition with signed saturation</caption>
+</figure>
+<figure mmxpaddusw>
+{{ :en:multiasm:cs:mmxpaddusw.png?450 |Illustration of packed word addition with unsigned saturation}}
+<caption>The illustration of packed word addition with unsigned saturation</caption>
+</figure>
+The last letter in the instruction specifies the size of arguments and results. MMX addition and subtraction instructions are shown in the table {{ref>mmxaddsub}}
+<table mmxaddsub>
+<caption>MMX addition and subtraction instructions</caption>
+^ Mnemonic ^ operation ^ argument size ^ overflow management ^
+| **paddb** | addition | 8 bytes | wraparound |
+| **paddw** | addition | 4 words | wraparound |
+| **paddd**  | addition | 2 doublewords | wraparound |
+| **paddsb** | addition | 8 bytes | signed saturation |
+| **paddsw** | addition | 4 words | signed saturation |
+| **paddusb** | addition | 8 bytes | unsigned saturation |
+| **paddusw** | addition | 4 words | unsigned saturation |
+| **psubb** | subtraction | 8 bytes | wraparound |
+| **psubw** | subtraction | 4 words | wraparound |
+| **psubd**  | subtraction | 2 doublewords | wraparound |
+| **psubsb** | subtraction | 8 bytes | signed saturation |
+| **psubsw** | subtraction | 4 words | signed saturation |
+| **psubusb** | subtraction | 8 bytes | unsigned saturation |
+| **psubusw** | subtraction | 4 words | unsigned saturation |
+</table>
+Multiplication operation requires twice as much space for the result as the size of the arguments. The solution for this issue is to split the operation into two multiplication instructions, storing the higher and lower halves of the results. For vectors of words, the instructions are **pmulhw** (packed multiply with storing higher halves of the results) and **pmullw** (packed multiply with storing lower halves of the results), respectively. Later halves can be joined together, forming full results with unpack instructions. To unpack the higher halves of arguments, the **punpckhwd** (unpack from higher halves words to doublewords) instruction can be executed. For lower halves, the **punpcklwd** instruction can be used. The whole algorithm is shown in figure {{ref>mmxmultiplyandunpck}}.
+<figure mmxmultiplyandunpck>
+{{ :en:multiasm:cs:mmxmultiplyandunpck.png?550 |Illustration of packed word multiplication and unpacking results to doublewords}}
+<caption>The illustration of packed word multiplication and unpacking results to doublewords</caption>
+</figure>
+The code that calculates the presented multiplication can look as follows:
+<code asm>
+Numbers DW  01ACh, 2112h, 03F3h, 00A4h,
+h, 0137h, 0AB7h, 00D8h
+LEA         ESI, Numbers
+MOVQ        mm0, [ESI]          ; mm0 = 00A4 03F3 2112 01AC
+MOVQ        mm1, [ESI+8]        ; mm1 = 00D8 0AB7 0137 0006
+MOVQ        mm2, mm0
+PMULLW      mm0, mm1            ; mm0 = 8A60 50B5 2CDE 0A08
+PMULHW      mm1, mm2            ; mm1 = 0000 002A 0028 0000
+MOVQ        mm2, mm0
+PUNPCKLWD   mm0, mm1            ; mm0 = 0028 2CDE 0000 0A08
+PUNPCKHWD   mm2, mm1            ; mm2 = 0000 8A60 002A 50B5
+</code>
+==== Advanced calculations ====
+The MMX set of instructions also contains the **pmaddwd** (multiply and add packed words to doublewords) instruction. It computes the products of the corresponding signed word operands. The four intermediate doubleword products are summed in pairs to produce two doubleword results. Its behaviour is shown in figure {{ref>mmxpmaddw}}. This instruction can simplify the multiplication process in the case of multiplying two pairs of word arguments, while the other two pairs give results of zero.
+<figure mmxpmaddw>
+{{ :en:multiasm:cs:mmxpmaddw.png?550 |Illustration of packed word multiplication and sum to doublewords}}
+<caption>The illustration of packed word multiplication and sum to doublewords</caption>
+</figure>
+==== Comparison ====
+The set of comparison instructions allows for comparing values in two vectors. The result is stored as a mask of bits, with all ones at the element of the vector where the comparison result is true, and all zeros in the opposite case. There are six compare instructions as shown in table {{ref>mmxtabcompare}}.
+<table mmxtabcompare>
+<caption>MMX comparison instructions</caption>
+^ Mnemonic ^ comparison type ^ argument size ^
+| **pcmpeqb** | equal | 8 bytes |
+| **pcmpeqw** | equal | 4 words |
+| **pcmpeqd** | equal | 2 doublewords |
+| **pcmpgtb** | greater than | 8 bytes |
+| **pcmpgtw** | greater than | 4 words |
+| **pcmpgtq** | greater than | 2 doublewords |
+</table>
+An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}.
+<figure mmxcompare>
+{{ :en:multiasm:cs:mmxcompare.png?450 |Illustration of vector data comparison}}
+<caption>Vector data comparison</caption>
+</figure>
+==== Data conversion ====
+The unpack instructions presented in figure {{ref>mmxmultiplyandunpck}} are not the only ones. Unpack instructions of high-order data elements are **punpckhbw**, **punpckhwd**, **punpckhdq**, and for low-order data elements are **punpcklbw**, **punpcklwd**, **punpckldq**. The figure {{ref>mmxpunpckhbw}} presents unpacking of high-order bytes to words, and figure {{ref>mmxpunpcklbw}} low-order bytes to words.
+<figure mmxpunpckhbw>
+{{ :en:multiasm:cs:mmxpunpkhbw.png?550 |Illustration of unpacking high-order bytes to words}}
+<caption>The illustration of unpacking high-order bytes to words</caption>
+</figure>
+<figure mmxpunpcklbw>
+{{ :en:multiasm:cs:mmxpunpklbw.png?550 |Illustration of unpacking low-order bytes to words}}
+<caption>The illustration of unpacking low-order bytes to words</caption>
+</figure>
+The pack instructions are used to shrink the size of arguments and pack them into smaller data. Only three pack instructions are implemented in MMX extension: **packsswb** - pack words into bytes with signed saturation, **packssdw** - pack doublewords into words with signed saturation, and **packuswb** - pack words into bytes with unsigned saturation. The example of pack instruction is shown in figure {{ref>mmxpack}}.
+<figure mmxpack>
+{{ :en:multiasm:cs:mmxpacksswd.png?550 |Illustration of packing doublewords to words}}
+<caption>The illustration of packing doublewords to words</caption>
+</figure>
+==== Shift ====
+Packed shift instructions perform shift operations and elements of the specified size. All elements of the vector are shifted separately. In a logical shift, empty bits are filled with zeros, in arithmetical shift right, the higher bit is copied to preserve the sign of values.
+There are eight shift instructions, as presented in table {{ref>mmxshift}}
+<table mmxshift>
+<caption>MMX shift instructions</caption>
+^ Mnemonic ^ operation ^ argument size ^ type of shift ^
+| **psllw** | shift left | 4 words | logical |
+| **pslld** | shift left | 2 doublewords | logical |
+| **psllq** | shift left | 1 quadword | logical |
+| **psrlw** | shift right | 4 words | logical |
+| **psrld** | shift right | 2 doublewords | logical |
+| **psrlq** | shift right | 1 quadword | logical |
+| **psraw** | shift right | 4 words | arithmetic |
+| **psrad** | shift right | 2 doublewords | arithmetic |
+</table>
+==== Logical ====
+MMX logical instructions operate on the 64-bit data as a whole. They perform bitwise operations as shown in table {{ref>mmxlogical}}.
+<table mmxlogical>
+<caption>MMX logical instructions</caption>
+^ Mnemonic ^ operation ^
+| **pand** | AND |
+| **pnand** | AND NOT |
+| **por** | OR |
+| **pxor** | XOR |
+</table>
+==== Co-existence of FPU and MMX ====
+MMX instructions use the same physical registers as the FPU. As a result, mixing FPU and MMX instructions in the same fragment of the code is not possible. Switching between FPU and MMX in a code requires executing the **emms** instruction, which resets the FPU and MMX units. Fortunately, newer extensions (SSE, AVX) introduce a separate set of registers for improved flexibility.
+===== SSE =====
+The SSE is a large group of instructions that implement the SIMD processing towards floating point calculations and increase the size and number of the registers. The abbreviation SSE comes from the name Streaming SIMD Extensions. As the number of instructions introduced in all SSE versions exceeds a few hundred, in this section, we will present the general overview of each SSE version and detailed information on some chosen interesting instructions.
+The first group of SSE instructions defines a new vector data type containing four single-precision floating-point numbers. It's easy to calculate that it requires the 128-bit registers. These new registers are named XMM0 - XMM7 and are separated from any previously implemented registers, so SSE floating-point operations do not conflict with MMX and FPU.
+==== Data transfer ====
+In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The **movups** copies packed single-precision data from any address, while the **movaps** moves data from an aligned address. The **movss** moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the **movhps** instruction, between the lower half of the XMM register and memory with the **movlps**, and from the lower to higher half or from the higher to lower half of the XMM registers with the **movhlps** and **movlhps**, respectively. The **movmskps** instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector.
+==== Calculations ====
+The SSE implements the vector and scalar calculations on single-precision floating-point numbers. No prefix for instruction names operating on floating-point numbers was added, but the mnemonic suffix describes the type. PS (packed single) - action on vectors,
+SS (scalar single) - operation on scalars. If the instructions operate on halves of XMM registers (i.e. either refer to bits 0..63 or 64..127), the instruction mnemonics contain the letter L or H.
+The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively.
+<figure sse1vector>
+{{ :en:multiasm:cs:sse1vector.png?550 |Illustration of the idea of SSE vector data processing}}
+<caption>The idea of vector data processing in SSE</caption>
+</figure>
+<figure sse1scalar>
+{{ :en:multiasm:cs:sse1scalar.png?550 |Illustration of the idea of SSE scalar data processing}}
+<caption>The idea of scalar data processing in SSE</caption>
+</figure>
+In the SSE extension, mathematical calculations on single-precision floating-point numbers are implemented in both vector (packed) and scalar versions. These instructions are summarised in table {{ref>ssemath}}.
+<table ssemath>
+<caption>SSE math calculations instructions</caption>
+^ Mnemonic ^ operation ^ argument type ^
+| **addps** | addition | vector |
+| **addss** | addition | scalar |
+| **subps** | subtraction | vector |
+| **subss** | subtraction | scalar |
+| **mulps** | multiplication | vector |
+| **mulss** | multiplication | scalar |
+| **divps** | division | vector |
+| **divss** | division | scalar |
+| **rcpps** | reciprocal | vector |
+| **rcpss** | reciprocal | scalar |
+| **sqrtps** | square root | vector |
+| **sqrtss** | square root | scalar |
+| **rsqrtps** | reciprocal of square root | vector |
+| **rsqrtss** | reciprocal of square root | scalar |
+| **maxps** | maximum (bigger) value | vector |
+| **maxss** | maximum (bigger) value | scalar |
+| **minps** | minimum (smaller) value | vector |
+| **minss** | minimum (smaller) value | scalar |
+</table>
+==== Comparison ====
+Besides math calculations, there are instructions for comparing vector **cmpps** and scalar **cmpss** values. As a result, we obtain the all-ones or all-zeros fields as in MMX. The condition of comparison is encoded as the third 8-bit immediate argument. Assemblers usually implement a set of pseudoinstructions which automatically choose the constant value. The scalar version of these pseudoinstructions is presented in table {{ref>ssecompare}}
+<table ssecompare>
+<caption>SSE scalar comparison preudoinstructions</caption>
+^ Pseudoinstruction ^ operation ^ instruction ^
+| **cmpeqss** xmm1, xmm2 | equal | cmpss xmm1, xmm2, 0 |
+| **cmpltss** xmm1, xmm2 | less then | cmpss xmm1, xmm2, 1 |
+| **cmpless** xmm1, xmm2 | less or equal | cmpss xmm1, xmm2, 2 |
+| **cmpunordss** xmm1, xmm2 | unordered | cmpss xmm1, xmm2, 3 |
+| **cmpneqss** xmm1, xmm2 | not equal | cmpss xmm1, xmm2, 4 |
+| **cmpnltss** xmm1, xmm2 | not less then | cmpss xmm1, xmm2, 5 |
+| **cmpnless** xmm1, xmm2 | not less or equal | cmpss xmm1, xmm2, 6 |
+| **cmpordss** xmm1, xmm2 | ordered | cmpss xmm1, xmm2, 7 |
+</table>
+With the use of the **comiss** instruction, it is possible to compare scalars and set the flags in the FLAG register directly according to the result of the comparison.
+==== Logical instructions ====
+There are four logical instructions which operate on all 128 bits of the XMM register. These are **andps**, **andnps**, **orps**, **xorps**. It is rather clear that the functions are logical and, logical and not, logical or and logical xor, respectively.
+==== Data shuffle ====
+Two unpack instructions, **unpcklps** and **unpckhps**, operate similarly to unpack instructions known already from MMX. Because the source and destination data are packed single-precision floating-point values, unlike in MMX, these instructions are not used to form longer data types, but change the positions of two elements from two vectors. It is presented in figure {{ref>sseunpack}}.
+<figure sseunpack>
+{{ :en:multiasm:cs:sseunpack.png?550 |Illustration of SSE unpacking single-precision floating-point values}}
+<caption>The illustration of SSE unpacking single-precision floating-point values</caption>
+</figure>
+The more universal instruction is **shufps**. It selects two out of four single-precision values from the source argument and rewrites them to the bottom half of the destination argument. The upper half is filled with two single-precision values from the destination register. Which values will be taken is determined by the third, 8-bit immediate argument. Each two-bit field of the immediate determines the number of packed single values. For 11 - it is X3 or Y3, for 10 - X2 or Y2, for 01 - X1 or Y1 and for 00 - X0 or Y0. It is presented in figure {{ref>sseshuffle}}.
+<figure sseshuffle>
+{{ :en:multiasm:cs:sseshuffle.png?550 |Illustration of SSE shuffle single-precision floating-point values}}
+<caption>The illustration of SSE shuffle single-precision floating-point values</caption>
+</figure>
+==== Other instructions ====
+Together with new data registers, an additional control register appeared in the processor. It is named MXCSR and is similar in meaning of flags to the FPU control register. New instructions are implemented, one to save MXCSR to memory **stmxcsr**, and one to restore this register from memory **ldmxcrs**. The instruction **fxsave** stores the state of both the x87 unit and the SSE extension, and **fxrstor** restores the state of both the x87 unit and the SSE extension.
+Some additional MMX instructions were introduced together with the SSE extension.
+In the SSE, the first set of data conversion instructions was implemented. The summary of all these instructions will be presented in the following chapter. Also, cache supporting instructions were added. These instructions will be described in the chapter on optimisation.
+===== SSE2 =====
+The SSE2 set of instructions implements integer vector operations using the XMM registers. In general, the same instruction mnemonics defined for MMX can be used with XMM registers, offering twice as long vectors. Additionally, the floating-point calculations are complemented with vector operations on the double-precision data type. In XMM registers, vectors of two double-precision values can be processed. SSE2 uses the same software environment (eight 128-bit XMM registers) as SSE.
+In the SSE2 extension, the denormals-are-zeros mode was introduced. The processor automatically converts all unnormalized floating-point arguments to zero. In such a case, a flag indicating the denormalised argument is not set, and an exception is not raised. The denormals-are-zeros mode is not compliant with IEEE Standard 754, but it allows implementation of faster algorithms for advanced audio and video processing.
+The arithmetic instructions are similar to SSE, but they possess the suffix of **pd** - packed double or **sd** - scalar double instead of **ps** and **ss**, respectively.
+==== Conversion ====
+In figure {{ref>sse2conversions}} we present the type conversion instructions. They enable conversion between integer and floating-point data of various sizes and in different registers. The green arrows and instruction nodes represent the conversion from single-precision floating-point to integers, pink represents the conversion from double-precision floating-point to integers, blue represents the conversion from integers to floating points, and orange represents the conversion between single and double precision floating-point.
+<figure sse2conversions>
+{{ :en:multiasm:cs:sse2conversions.png?550 |Illustration of a variety of data type conversion instructions}}
+<caption>The illustration of a variety of data type conversion instructions</caption>
+</figure>
+===== SSE3 =====
+The SSE3 is a set of 13 instructions. The main innovation in SSE3 is the implementation of horizontal instructions. These instructions perform calculations on the elements of a vector within the same register. There are four such instructions. The **haddpd** performs horizontal addition of double-precision values, the **hsubpd** is a horizontal subtraction of double-precision values, the **haddps** is a horizontal addition of single-precision values, and the **hsubps** is a horizontal subtraction of single-precision values.
+All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}.
+<figure sse3hsubpd>
+{{ :en:multiasm:cs:sse3hsubpd.png?550 |Illustration of a horizontal subtraction instruction}}
+<caption>The illustration of a horizontal subtraction instruction</caption>
+</figure>
+While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}.
+<figure sse3hsubps>
+{{ :en:multiasm:cs:sse3hsubps.png?550 |Illustration of a horizontal single precision subtraction instruction}}
+<caption>The illustration of a horizontal single precision subtraction instruction</caption>
+</figure>
+===== SSSE3 =====
+This abbreviation stands for Supplemental Streaming SIMD Extension 3. It is a set of 16 instructions introduced in the Core 2 architecture. It implements integer horizontal operations on XMM registers. The principles are the same as in horizontal instructions in SSE3, but instructions can process vectors of doublewords or words. They are summarised in the table {{ref>SSSE3horizontaltable}}.
+<table SSSE3horizontaltable>
+<caption>SSSE3 horizontal integer instructions</caption>
+^ Instruction  ^ operation              ^ data                  ^
+| **phaddd**   | addition               | unsigned doublewords  |
+| **phaddw**   | addition               | unsigned words        |
+| **phaddsw**  | saturated addition     | signed words          |
+| **phsubd**   | subtraction            | unsigned doublewords  |
+| **phsubw**   | subtraction            | unsigned words        |
+| **phsubsw**  | saturated subtraction  | signed words          |
+</table>
+Two data shuffle instructions are worth mentioning. The **pshufb** instruction makes copies of bytes from the first 128-bit operand based on the control information taken from the second 128-bit operand. Each byte in the control operand determines the resulting byte in the respective position.
+  * bit 7 is 1 - byte is cleared
+  * bit 7 is 0 - byte contains a copy of the source byte
+  * bits 0-3 - a number of the source byte to be copied
+The illustration is shown in figure {{ref>sse3pshufb}}.
+<figure sse3pshufb>
+{{ :en:multiasm:cs:sse3pshufb.png?600 |Illustration of a byte shuffle instruction}}
+<caption>The illustration of a byte shuffle instruction</caption>
+</figure>
+The **palignr** instruction combines bytes from two source operands as shown in figure {{ref>sse3palignr}}. The position of the byte split is specified as third immediate. In the figure, the immediate is equal to 2.
+<figure sse3palignr>
+{{ :en:multiasm:cs:sse3palignr.png?600 |Illustration of an aligned byte combine instruction}}
+<caption>The illustration of an aligned byte combine instruction</caption>
+</figure>
+===== SSE4 =====
+The SSE4 is composed of SSE4.1 and SSE4.2. These groups include instructions supplementing previous extensions. For example, there are eight instructions which expand support for packed integer minimum and maximum determination, or twelve instructions which improve packed integer format conversions with sign extension and zero extension.
+The **dpps** and **dppd**  instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}.
+<figure sse4dotproduct>
+{{ :en:multiasm:cs:sse4dotproduct.png?600 |Illustration of a dot product calculation instruction}}
+<caption>The illustration of a dot product calculation instruction</caption>
+</figure>
+There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. The type of the data is specified with the suffix of the mnemonic: b - bytes, w - words, d - doublewords, q - quadwords, ps - single precision and pd - double precision elements. Although these instructions behave the same for the integer and floating-point data elements, formally, those operating with integers begin with the letter "P". A few examples are shown in the following figures.
+The blending instructions copy elements of vectors, mixing two sources into the destination. The **blendps**, **blendpd** and **pblendw** conditionally copy elements from vector X or Y. The mask is specified as the third, immediate value. The behaviour of **blendpd** is shown in fig. {{ref>sse4blendpd}}
+<figure sse4blendpd>
+{{ :en:multiasm:cs:sse4blendpd.png?400 |Illustration of an example of packed blending instruction}}
+<caption>The illustration of an example of packed blending instruction</caption>
+</figure>
+The instructions **blendvps**, **blendvpd** and **pblendvb** operate in a similar way, but the condition is specified as the sign bit of the corresponding elements of the third implied argument stored in XMM0. The behaviour of **blendvpd** is shown in fig. {{ref>sse4blendvpd}}
+<figure sse4blendvpd>
+{{ :en:multiasm:cs:sse4blendvpd.png?400 |Illustration of an example of packed blending instruction}}
+<caption>The illustration of an example of packed blending instruction</caption>
+</figure>
+The set of extract instructions includes **pextrb**, **pextrw**, **pextrd**, **pextrq** and **extractps**. They take one element of the vector from the XMM register and store it in a CPU register or in memory. The offset of the element is specified with an immediate constant. The behaviour of **extractps** is shown in fig. {{ref>sse4extractps}}
+<figure sse4extractps>
+{{ :en:multiasm:cs:sse4extractps.png?400 |Illustration of an example of extract instruction}}
+<caption>The illustration of an example of extract instruction</caption>
+</figure>
+The insert instructions are **pinsrb**, **pinsrd** and **pinsrq**. They operate in an opposite way to extract instructions. They take an element from memory or a general-purpose register and insert it into the XMM register at the position specified with a constant immediate. The behaviour of **pinsrd** is shown in fig. {{ref>sse4pinsrd}}
+<figure sse4pinsrd>
+{{ :en:multiasm:cs:sse4pinsrd.png?400 |Illustration of an example of an insert instruction}}
+<caption>The illustration of an example of an insert instruction</caption>
+</figure>
+The **insertps** is one of the most complex. inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b.
+<figure sse4insertps>
+{{ :en:multiasm:cs:sse4insertps.png?500 |Illustration of an example of an advanced shuffle instruction}}
+<caption>The illustration of an example of an advanced shuffle instruction</caption>
+</figure>
+In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of string instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).
+<table sse4stringtable>
+<caption>SSE4.2 string compare instructions</caption>
+^ Instruction ^ length ^ type of the result ^
+| **pcmpestri** | explicit | index |
+| **pcmpestrm** | explicit | mask |
+| **pcmpistri** | implicit | index |
+| **pcmpistrm** | implicit | mask |
+</table>
+The third, immediate operand encodes the comparison method and result encoding.
+<table SSE4stringdata>
+<caption>SSE4.2 string compare input data</caption>
+^ bits 1:0 ^ data type ^
+| 00 | unsigned BYTE |
+| 01 | unsigned WORD |
+| 10 | signed BYTE |
+| 11 | signed WORD |
+</table>
+<table SSE4stringcomparisonmethod>
+<caption>SSE4.2 string compare method encoding</caption>
+^ bits 3:2 ^ operation ^ comment ^
+| 00 | Equal Any | find any of the specified characters in the input string |
+| 01 | Ranges | check if characters are within the specified ranges |
+| 10 | Equal Each | check if the input strings are equal |
+| 11 | Equal Ordered | check if the needle string is in the haystack string |
+</table>
+The SSE4.2 string compare instructions are advanced, powerful means for processing byte or word strings. The detailed explanation of SSE4.2 string instructions behaviour together with illustrations can be found on ((https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpestri)).
+===== AVX =====
+AVX is the abbreviation of Advanced Vector Extensions. The AVX implements larger 256-bit YMM registers as extensions of XMM. In 64-bit processors number of YMM registers is increased to 16. Many SSE instructions are expanded to handle operations with new, bigger data types without modification of mnemonics. The most important improvement in the instruction set of x64 processors is the implementation of RISC-like instructions in which the destination operand can differ from two source operands. A three-operand SIMD instruction format is called the VEX coding scheme. The AVX2 extension implements more SIMD instructions for operation with 256-bit registers. The AVX-512 extends the register size to 512 bits. An interesting, comprehensive description of a variety of x64 AVX instructions is available on website ((https://www.officedaytime.com/simd512e/)).