# SIMD: Single Instruction - Multiple Data

• Motivation for SIMD

• Computer programs can be:

 Compute bound - bottleneck in executing the program is the CPU IO bound - bottleneck in executing the program are the I/O operations

• Many programs that are compute bound, are often numerical analysis applications - such as "weather prediction", "nuclear reactor simulation", etc.

• These problems often use vector and/or matrix operations

• A vector is just an array of numerical values, e.g.:

 ``` +- -+ | 1 | v = | 1 | | 1 | +- -+ ```

• A matrix is a two-dimensional arry of numerical values, e.g.:

 ``` +- -+ | 1 0 0 | A = | 1 0 0 | | 1 0 0 | +- -+ ```

• In many operations involving vectors and matrices, the same operation is performed on many (different) rows or columns

 ``` +- -+ +- -+ +- -+ | 1 | | 5 | | 6 | | 2 | + | 4 | = | 7 | | 3 | | 1 | | 4 | +- -+ +- -+ +- -+ ```

(The addition operation is applied to all rows)

• Example 2: Matrix Multiplication

 ``` +- -+ |A11 A12 A13| A = |A21 A22 A23| |A31 A32 A33| +- -+ ``` ``` +- -+ |B11 B12 B13| B = |B21 B22 B23| |B31 B32 B33| +- -+ ``` Then: ``` +- -+ |C11 C12 C13| C = A*B = |C21 C22 C23| |C31 C32 C33| +- -+ where: Cij = Ai1*B1j + Ai2*B2j + Ai3*B3j (for i = 1, 2, 3 and j = 1, 2, 3) ```

• Each row of matrix A is multiple with each column of matrix B

• The operations performed are the same (multiple and then add): SAME instructions

But each row of matrix A uses DIFFERENT operands (data)

• Example Parallel execution

• To give you a better idea what is involved in performing instructions in parallel, we will consider in some details a parallel Matrix-matrix Multiplication

• Suppose we want to multiply the following 2 matrices:

• We will go through the steps in the parallel Matrix-matrix Multiplication

While going through the example, make a note that the same operations are performed on each column.

Initialization step (done once):

Processing Row 1:

Processing Row 2:

And so on....

• SIMD computers are the first type of parallel computers specially designed to perform

• Architecture of SIMD computers

• SIMD computers are also known as vector computers - because they provide a special set of machine instructions that operate on vectors.

• SIMD computers also have special vector registers that the vector instructions operate on.

• Here is a schematic of the Cray 1 vector processor (CPU):

• Notice the Cray-1 has multiple addresses registers

The Cray has multiple system busses:

• The Cray it can send out multiple memory requests with different addresses

So it can fetch multiple (upto 64 !) operands from memory at the same time - But ONLY IF the operands are stored in DIFFERENT memory banks

• Example execution inside the Cray-1:

To perform this step in the matrix multiplication:

It must

• Initialize the result vector registers to 0 - in parallel: (only once per matrix multiplication)

• Fetch scalar multiplier into scalar register:

• Fetch vector operand from memory into (another set of) vector registers - in parallel:

• Then perform these 3 multiply operations (also in parallel):

• 2 x 3
• 2 x 6
• 2 x 2

• Then perform these 3 addition operations (also in parallel):

• 0 + (2 x 3)
• 0 + (2 x 6)
• 0 + (2 x 2)

• The Vector Functional Units has multiple ALU units that can perform multiple operations (on different values) simultaneously - under the control on the (same one) vector instruction

• Footnote

• Today, the SIMD architecture is found in the Graphic Processor Unit (GPU)

• A popular programming interface (API) for GPU is: CUDA (Compute Unified Device Architecture)