CS221

Introduction to MMX

 

MMX™ was introduced by Intel in late 1997 amid much fanfare and commercials featuring bunny people.  MMX promised to increase the speed of our games and multimedia programs.  Unfortunately for Intel, some companies were slow to adopt MMX and the shift from 2D to 3D games meant that MMX was not as applicable.  As a result, the release of MMX did not become a major selling point as Intel hoped.  Nevertheless, it is today an important feature in many applications.  What is MMX?  MMX technology consists of a number of additional machine instructions from the basic Intel instruction set that operate on data in parallel.  This technology is typically used to speed up common routines used in multimedia or mathematical calculations.

 

For MMX to work, the software in question must contain some inherent parallelism.  A wide range of software applications, including graphics, MPEG video, music synthesis, speech compression and recognition, image processing, games, and video conferencing show many common, fundamental characteristics that lend themselves well to parallelism:

  • small integer data types (for example: 8-bit pixels, 16-bit audio samples) 
  • small, highly repetitive loops 
  • frequent multiplies and accumulates 
  • compute-intensive algorithms 
  • highly parallel operations 

 

Before diving into how MMX works, let’s examine Flynn’s taxonomy of programming models.  Flynn identified four categories of programming:

 

1. Single Instruction stream, Single Data stream (SISD)

2. Multiple Instruction stream, Single Data stream (MISD)

3. Single Instruction stream, Multiple Data stream (SIMD)

4. Multiple Instruction stream, Multiple Data stream (MIMD)

 

The SISD model is the one you have been using in most of your programming courses to date.  There is one stream of instructions and one stream of data that these instructions operate open.  The processor performs one instruction at a time on each data item.

 

SISD

 

The MISD model is rarely used (but might have potential for problems such as classification).  In the MISD model, there are many instructions that are applied to a single data stream.  This means that we apply many different instructions to the same piece of data. 

 

MISD

 

The SIMD model is the one that MMX operates under.  There is a single stream of instructions, but each instruction can operate in parallel on multiple pieces of data.  The processors operate synchronously: at each step, all processors execute the same instruction on a different data element.   SIMD computers are much more versatile that MISD computers. Numerous problems covering a wide variety of applications can be solved by parallel algorithms on SIMD computers. Another interesting feature is that algorithms for these computers are relatively easy to design, analyze and implement. On the downside, only problems that can be subdivided into a set of identical subproblems all of which are then solved simultaneously by the same set of instructions can be tackled with SIMD computers. There are many computations that do not fit this pattern: such problems are typically subdivided into subproblems that are not necessarily identical, and are solved using MIMD computers.

 

SIMD

 

The last model, MIMD, uses multiple processors with multiple data streams.  Each processor has its own independent data stream.  Each processor operates under the control of an instruction stream issued by its control unit: therefore the processors are potentially all executing different programs on different data while solving different subproblems of a single problem. This means that the processors usually operate asynchronously.  The MIMD model of parallel computation is the most general and powerful: computers in this class are used to solve in parallel those problems that lack the regular structure required by the SIMD model. On the downside, asynchronous algorithms are difficult to design, analyze and implement.

 

MIMD

 

Now that we have covered the basic programming models, let’s return to MMX.  MMX operates under the SIMD model, so this operates best on programs that apply the same operation to multiple pieces of data.

 

The highlights of MMX are:

 

57 new instructions 

8 64-bit wide MMX registers 

4 new data types 

 

The MMX registers are each 64 bits wide and are named MM0 through MM7.  These registers are actually overlapped with the floating point registers, so it is not possible to interleave floating point and MMX instructions with each other if they reference the same register.  As an advantage though, we can use the floating point instructions to save/restore the floating point registers and this will also apply to the MMX registers.

 

The four MMX technology data types are: 

 

  • Packed byte - 8 bytes packed into one 64-bit quantity 
  • Packed word - 4 16-bit words packed into one 64-bit quantity 
  • Packed doubleword – 2 32-bit double words packed into one 64-bit quantity 
  • Quadword - one 64-bit quantity.  The Qword data type is used as a typecast.

As an example, graphics pixel data are generally represented in 8-bit integers, or bytes. With MMX technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register; when an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. The degree of parallelism that can be achieved with the MMX technology depends on the size of data, ranging from 8 when using 8-bit data to 1, i.e. no parallelism, when using 64-bit data.

 

 

We are now ready to describe some of the MMX instructions.  In general, each MMX instruction is of the format:

 

            OPCODE  dest-operand,   src-operand

 

The classes of instructions cover:

 

  • Data transfer instructions for MMX register-to-register transfers, or 64-bit and 32-bit load/store to memory 
  • Basic arithmetic operations such as add, subtract, multiply, arithmetic shift and multiply-add 
  • Comparison operations 
  • Conversion instructions to convert between the new data types: pack data together, and unpack from small to larger data types 
  • Logical and shift operations such as AND, AND NOT,OR, XOR, and SHIFT.
  • State management instruction to handle MMX to floating point transitions

 

We won’t cover all of these, but will look at some of them.

To start, we need some way to transfer data into an MMX register.  We do this with the MOVQ or MOVD instructions:


MOVD:   Copies 32 bits from the source operand to the destination operand.   The operands can be either an MMX register, a 32-bit register, or a memory location.   When the destination is an MMX register, the low-order 32 bits are copied and the high-order 32 bits are filled with zeros.  When the source is an MMX register, the low-order 32 bits only are copied.

 

Examples:

 

            movd  mm0, eax                                   ; Copy eax to low order bits of mm0

            movd eax, mm0                                    ; Copy low order bits of mm0 to eax

            movd mm0, dword ptr myvar                ; Copy 4 bytes from myvar to mm0

            movd mm0, dword ptr [ebx]                 ; Use indirect addressing to copy to mm0

 

MOVQ:  Copies 64 bits from the source operand to the destination operand.  The destination and source operands can be either MMX registers or 64-bit memory operands, but MOVQ cannot transfer data from memory to memory.

 

Examples:

           

            movq mm0, qword ptr myvar                ; Copy 8 bytes from myvar to mm0

            movq mm0, mm1                                  ; Copy mm1 to mm0

            movq qword ptr myvar, mm0                ; Copy 8 bytes to memory at myvar

            movq mm0, qword ptr [ebx]                 ; Indirect addressing

 

Note that bytes are copied from memory in reverse byte order.

 

Next, let’s examine some of the arithmetic instructions.  These are the instructions that will actually give us speedup through the use of SIMD processing.  The MMX technology supports both saturating and wraparound modes. In wraparound mode, results that overflow or underflow are truncated and only the lower (least significant) bits of the result are returned. This is the way normal arithmetic works in the computer.  In saturation mode, results of an operation that overflow or underflow are clipped (saturated) to a data-range limit for the data type. The result of an operation that exceeds the range of a data type saturates to the maximum value of the range, while a result that is less than the range of a data type saturates to the minimum value of the range. This method of handling overflow and underflow is useful in many applications, such as color calculations.  For examine, using an unsigned byte, the saturation range is from 0-255.  Any value that results in less than zero or more than 255 is clipped accordingly.

 

PADD : Packed Add

            PADDB dest, src                     ; Add packed bytes

            PADDW dest, src                    ; Add packed words

            PADDD dest, src                     ; Add packed doubleword

These instructions add the data elements of the source operand to the data elements of the destination, and the result is written to the destination.  If the result exceeds the data range limit, it wraps around.

 

PADDDS : Packed Add with Saturation

            PADDSB dest, src                   ; Add packed bytes w/saturation

            PADDSW dest, src                  ; Add packed words w/saturation

These instructions are as above, but add with saturation instead of wrap around.

 

PADDUS : Packed Add Unsigned with Saturation

            PADDUSB dest, src                ; Add packed bytes unsigned w/saturation

            PADDUSW dest, src               ; Add packed words unsigned w/saturation

The PADDUS (Packed Add Unsigned with Saturation) instructions add the packed unsigned data elements of the source operand to the packed unsigned data elements of the destination operand and saturate the results.   Note that there is no Packed Add Unsigned without saturation. 

 

PSUB : Packed Subtract

            PSUBB dest, src                      ; Subtract packed bytes

            PSUBW dest, src                     ; Subtract packed words

            PSUBD dest, src                      ; Subtract packed doubleword

The PSUB (Packed Subtract) instructions subtract the data elements of the source operand from the data elements of the destination operand. If the result is larger or smaller than the data-range limit for the data type, it wraps around.

 

PSUBS : Packed Subtract with Saturation

            PSUBSB dest, src                    ; Subtract packed bytes w/saturation

            PSUBSW dest, src                   ; Subtract packed words w/saturation

Similar to add with saturation, but subtracts instead.

 

PSUBUS : Packed Subtract Unsigned with Saturation

            PSUBUSB dest, src                 ; Subtract packed bytes unsigned w/saturation

            PSUBUSW dest, src                ; Subtract packed words unsigned w/saturation

 

 

As a simple example of  packed add, consider the following:

.data

vector1 byte 1,2,3,4,5,6,7,8

vector2 byte 1,1,1,1,1,1,1,1

.code

            movq mm0, qword ptr vector1

            movq mm1, qword ptr vector 2

            paddb mm0,mm1                                 ; mm0 now contains 2,3,4,5,6,7,8,9

 

As an example of saturated arithmetic, let us consider the absolute difference of two arrays of bytes: there are no IF statements in MMX, but it is necessary to implement the following algorithm:

 

for (i=0; i<8; i++) {

if (a[i] > b[i])

 then c[i] = a[i] – b[i]

 else c[i] = b[i] – a[i]

}

 

This algorithm can be coded using saturated substractions on unsigned values: subtracting a from b and b from a.  Due to the saturated arithmetic, rather than get a negative result, one of the results will be zero.  The other will be the desired absolute difference. However, since it is impossible to know which is which, the final result is achieved by ORing them together:

 

c = (a – b) OR (b – a)

 

Assuming that the MMX registers named MM0 and MM1 hold the source vectors, the following code will compute the absolute difference and store it into MM0:

 

movq mm, mm0                        ; make a copy of MM0

psubusb mm0, mm1                  ; compute difference one way

psubusb mm1, mm2                  ; compute difference the other way

por mm0, mm1                         ; OR them together (packed OR instruction)

 

There are three multiply instructions:

 

PMULLW – Packed Multiply Low.  This multiplies the four signed words of the source and destination operands and writes the low-order bits of the result to the destination.

PMULHW – Packed Multiply High.  This multiplies the four signed words of the source and destination operands and writes the high-order bits of the result to the destination.

 

PMADDWD - Packed Multiply and Add

This instruction multiplies the four signed words of the destination operand by the four signed words of the source operand. The two high-order words are summed and stored in the upper doubleword of the destination operand, and the two low-order words are summed and stored in the lower doubleword of the destination operand.  This process is illustrated in the figure below:

 

This type of operation is useful, for example, in computing the dot product of two vectors.  This type of basic operation is commonly used in filters and various multimedia applications.  If our vector is longer then four elements, we will need to perform this operation in groups of four until we process the entire vector.

 

There are two types of comparison instructions:

 

PCMPEQ : Packed Compare for Equal

            PCMPEQB dest, src                            ; Compare packed bytes

            PCMEQW dest, src                             ; Compare packed words

            PCMEQD dest, src                              ; Compare packed dwords

These instructions compare the data elements in the destination operand to the corresponding data elements in the source operand. If the data elements are equal, the corresponding data element in the destination register is set to all ones, if they are not, it is set to all zeros.

 

PCMPGT : Packed Compare for Greater Than

            PCMPGTB dest, src                            ; Compare packed bytes

            PCMPGTW dest, src                           ; Compare packed words

            PCMPGTD dest, src                            ; Compare packed dwords

Similar to above, but uses a greater-than comparison instead of equality.

 

For a description of the other instruction, see the references link at the end of this document.

 

 

How much performance is gained from using MMX?   In theory, we could see anywhere from a 2 to 8 fold increase in performance.  However, we will typically need additional instructions to parallelize the application which will slow things down.  More importantly, some applications simply do not have a large number of items that can run in parallel and therefore will see little or no increase in performance.

 

Another downside is that most compilers do not support these instructions. You have to learn all these instructions sets and then code them in plain assembly yourself.  Fortunately, there are a variety of toolkits released by Intel that programmers can use from high level languages (e.g. C++) that will take advantage of the MMX instruction set for typical scenarios.

 

The figure below shows performance increases on benchmarks that were coded using MMX vs. those coded without MMX. 

 

 

Beyond MMX

 

There now exist a number of extensions to MMX:

 

SSE -  Streaming SIMD Extensions

SSE enhances the Intel x86 architecture in four ways:

  • 8 new 128-bit SIMD floating-point registers that can be directly addressed.  SSE supports operating on four 32 bit floating point values in parallel.
  • 50 new instructions that work on packed floating-point data;
  • 8 new instructions designed to control cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used;
  • 12 new instructions that extend the MMX instruction set.

This set enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively. 

 

SSE2 – Streaming SIMD Extensions 2

The major new feature in SSE2 is the ability to process 64 bit floats.  However, the register size is still 128 bits, so we can only process two 64 bit floats in parallel.  SSE2 also supports 128bit MMX instructions.

 

3DNOW!

The AMD 3D Now! technology was AMD’s answer to MMX and SSE. It provides 21 additional instructions to support high-performance 3D graphics and audio processing. The 3D Now! instructions are vector instructions that operate on 64-bit registers, divided into two 32-bit single-precision floats.   As with MMX and SSE, programs must be written specifically to use the 3DNOW instruction set, causing potential incompatibilities among software.

 

AltiVec

AltiVec is Motorola’s version of MMX and SSE, and is used in the Macintosh PowerPC.  Through a 128 bit vector, it supports;

16-way parallelism for 8-bit signed and unsigned integers

8-way parallelism for 16-bit signed and unsigned integers

4-way parallelism for 32-bit signed and unsigned integer and IEEE floats.

 

MMX Code Samples

 

The first example shows how MMX can be used to sum up the values in an array of 80 numbers.  To use it, we must add the .MMX directive to MASM.

 

Include Irvine32.inc

.686

.MMX

.data

array byte 80 dup(1)

sumeight byte 8 dup(?)

 

.code

main proc

      call SumNumsNonMMX

      call SumNumsWithMMX    

      exit 

main endp

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Sums all 80 numbers the old way

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

SumNumsNonMMX proc

      ; Sum the traditional way

      mov ebx, offset array

      mov eax, 0 

      mov ecx, 0

SumLoop:

      movzx edx, byte ptr [ebx]

      add eax, edx

      inc ebx

      inc ecx

      cmp ecx, 80

      jl SumLoop 

      call writeint                 ; Total of 80

      call crlf

      ret

SumNumsNonMMX endp

 

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Sums all 80 numbers using MMX

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

SumNumsWithMMX proc

      mov ebx, offset array

      add ebx, 8

      movq mm0, qword ptr [ebx]

      mov ecx, 8        ; Already read in first 8 numbers

SumLoopMMX:

      movq mm1, qword ptr [ebx]     ; Move in next 8 numbers

      paddb mm0, mm1                ; Add all 8 numbers in parallel

      add ebx, 8

      add ecx, 8

      cmp ecx, 80

      jl SumLoopMMX

      ; Need to sum each byte in mm0

      movq qword ptr sumeight, mm0

      mov ecx, 8

      mov eax, 0

      mov ebx, offset sumeight

SumFinalMMX:

      movzx edx, byte ptr [ebx]

      add eax, edx

      inc ebx

      loop SumFinalMMX

      call writeint

      call crlf

      ret

SumNumsWithMMX endp

     

end main

 

You will notice that a lot more code is necessary to perform the MMX add.  This is because we need extra code to step through the data in groups of eight and to also compute the sum of the final eight values.  However, the total number of instructions executed is less; instead of looping 80 times, the MMX code sum loop only repeats 9 times, with a single loop of 8 iterations to compute the final sum.

 

The next example compiles in real mode.   It switches to 320x200 graphics mode and sets up a palette of 64 shades of green.  It then fades the screen in and out from black to bright green by incrementing and decrementing the value of each pixel on the screen.   In non-MMX mode, this is done one pixel at a time.  In MMX mode, this is done 8 pixels at a time.

 

Include Irvine16.inc

 

.686

.MMX

.data

eightOnes byte 1,1,1,1,1,1,1,1

eightNegs byte -1,-1,-1,-1,-1,-1,-1,-1

 

.code

main proc

      mov ax, @data

      mov ds, ax

 

      ; Set up vectors of -1 and +1 in the MMX registers   

      movq mm1, qword ptr eightOnes

      movq mm2, qword ptr eightNegs

     

      ; Set video mode to graphics

      mov ah, 0

      mov al, 13h

      int 10h

           

      ; Set ES to video graphics

      mov ax, 0A000h

      mov es, ax

     

      call SetupPalette

      call ZeroScreen

      ;call FadeScreenNonMMX        ; Switch comments to change modes

      call FadeScreenWithMMX        ; Switch comments to change modes

 

      ; restore text mode

      mov ah, 0

      mov al, 3

      int 10h

     

      exit

main endp

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Increment and then decrement each pixel in the screen

; so we get a fade effect.  Do this without MMX,

; one pixel at a time.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

FadeScreenNonMMX proc

      ; Fill screen repeatedly with colors

      mov al, 1

      mov cx, 0

IncrementLoop:   

      call CheckKey    

      cmp dx, 1

      je EndDraw

      call UpdateScreen       ; non MMX

      inc cx

      cmp cx, 63

      jl IncrementLoop 

      mov al, -1 

DecrementLoop:

      call CheckKey    

      cmp dx, 1

      je EndDraw

      call UpdateScreen       ; Non MMX

      dec cx

      cmp cx, 1

      jne DecrementLoop

      mov al, 1

      jmp IncrementLoop      

EndDraw:

      ret

FadeScreenNonMMX endp

 

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Increment and then decrement each pixel in the screen

; so we get a fade effect.  Do this without MMX,

; one pixel at a time.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

FadeScreenWithMMX proc

      ; Fill screen repeatedly with colors

      mov al, 1

      mov cx, 0

IncrementLoop:   

      call CheckKey    

      cmp dx, 1

      je EndDraw

      call UpdateScreenMMX          ; with MMX

      inc cx

      cmp cx, 63

      jl IncrementLoop 

      mov al, -1 

DecrementLoop:

      call CheckKey    

      cmp dx, 1

      je EndDraw

      call UpdateScreenMMX          ; with MMX

      dec cx

      cmp cx, 1

      jne DecrementLoop

      mov al, 1

      jmp IncrementLoop      

EndDraw:

      ret

FadeScreenWithMMX endp

 

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Adds AL to the color of each pixel on the screen

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

UpdateScreen proc

      push bx

     

      ;call WaitVrt

      mov ebx,0

UpdateLoop:

      add es:[bx], al

      inc ebx

      cmp ebx, 64000          ; 64000= 320*200

      jne UpdateLoop

     

      pop bx

      ret

UpdateScreen endp

 

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Using MMX, either add or subtract 8 bytes at a time

; and send them to the video screen

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

UpdateScreenMMX proc

      push bx

      mov ebx,0  

      ;call WaitVrt

      cmp al, -1

      je UpdateLoopNeg

           

UpdateLoopPos:

      movq mm0,es:[bx]

      paddb mm0, mm1

      movq es:[bx],mm0

      add ebx,8

      cmp ebx,64000

      jl UpdateLoopPos

      jmp ExitUpdate

     

UpdateLoopNeg:

      movq mm0,es:[bx]

      paddb mm0, mm2

      movq es:[bx],mm0 

      add ebx,8

      cmp ebx,64000

      jl UpdateLoopNeg 

ExitUpdate:

      pop bx

      ret  

UpdateScreenMMX endp

 

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Set palette to 64 shades of green

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

SetupPalette proc

      mov cx, 0

PalLoop:

      mov dx, 3c8h            ; Video palette port

      mov al, cl        ; The color index we want to set

      out dx, al        ; Says to set color index to AL

     

      ; Load shade of green using RGB

      mov dx, 3c9h            ; RGB color for current color index

      mov al, 0         ; 0 red

      out dx, al

      mov al, cl        ; CL green, varies in intensity with loop

      out dx, al

      mov al, 0         ; 0 blue

      out dx, al

      inc cx

      cmp cx, 64

      jne PalLoop

      ret

SetupPalette endp

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Sets every pixel on the screen to zero

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

ZeroScreen proc

      ; Zero out the screen

      mov bx,0

      mov al, 0

UpdateLoop:

      mov es:[bx], al

      inc bx

      cmp bx, 64000           ; 64000= 320*200

      jne UpdateLoop   

      ret

ZeroScreen endp  

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Wait for the Vertical Retrace to begin

; Use if there is flashing on the screen

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

WaitVrt proc     

      push ax

      push dx

      mov dx, 3dah

      Vrt:

      in al, dx

        test al, 1000b

      jnz vrt                       ; Wait for vertical retrace to begin

      NoVrt:

            in al, dx

        test al, 1000b

        jz NoVrt              ; Wait for retrace to end

        pop dx

        pop ax

        ret

WaitVrt endp

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; CheckKey

;

; This procedure sets DX to 1 if there

; is a key waiting to be read, and sets

; DX to 0 if there is no keypress.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

CheckKey PROC

   push ax

   mov ah, 11h

   int 16h              ; BIOS interrupt for keypress

   jz NoKeyWaiting

   mov dx, 1

   pop ax

   ret

NoKeyWaiting:

   mov dx, 0

   pop ax

   ret

CheckKey ENDP

 

 

end main

 

References

 

MMX Primer: http://docs.tommesani.com/MMXPrimer.html

 

MMX Technology Overview: http://www.intel.com/technology/itj/q31997/articles/art_2.htm