Computer Architecture Notes Part-1

Computer Architecture Notes

TOPIC - Execution Cycle Example

LOAD A, R1 ; Load the value from memory location A into register R1

LOAD B, R2 ; Load the value from memory location B into register R2

ADD R1, R2 ; Add the values in registers R1 and R2 STORE R2, C ;

Store the result (in R2) back into memory location C

LOAD A, R1:

Fetch: The CPU fetches the instruction "LOAD A, R1" from memory.

Decode: The CPU decodes the LOAD instruction and identifies that it needs to load a value from memory location A into register R1.

Execute: The CPU accesses memory location A and retrieves the value stored there (which is 5). It then stores this value in register R1.

Write Back: Since it's a LOAD instruction, the result is stored directly in the specified register (R1). There's no need for an explicit write-back to memory.

LOAD B, R2:

Fetch: The CPU fetches the instruction "LOAD B, R2" from memory.

Decode: The CPU decodes the LOAD instruction and identifies that it needs to load a value from memory location B into register R2.

Execute: The CPU accesses memory location B and retrieves the value stored there (which is 7). It then stores this value in register R2.

Write Back: Again, since it's a LOAD instruction, the result is stored directly in the specified register (R2). There's no need for an explicit write-back to memory.

ADD R1, R2:

Fetch: The CPU fetches the instruction "ADD R1, R2" from memory.

Decode: The CPU decodes the ADD instruction and identifies that it needs to add the values in registers R1 and R2.

Execute: The CPU fetches the values from registers R1 (which holds the value from memory location A, i.e., 5) and R2 (which holds the value from memory location B, i.e., 7). It adds these values together (5 + 7 = 12) and then stores the result (12) in register R2.

Write Back: Since it's an ALU (Arithmetic Logic Unit) operation, the result is stored directly in the specified register (R2). There's no need for an explicit write-back to memory.

STORE R2, C:

Fetch: The CPU fetches the instruction "STORE R2, C" from memory.

Decode: The CPU decodes the STORE instruction and identifies that it needs to store the value in register R2 into memory location C.

Execute: The CPU fetches the value from register R2 (which holds the result of the addition, i.e., 12) and stores it into memory location C.

Write Back: The result is written back to memory location C.

After completing the above instructions, memory location C will now contain the value 12, which is the result of adding the values in memory locations A and B. The write-back step is crucial in this example, as it allows us to save the result of the computation back to memory for future use or further processing.

TOPIC - Amdahl's Law

Amdahl's Law, named after computer architect Gene Amdahl, is a fundamental principle in parallel computing and performance analysis. It provides a theoretical foundation for understanding the potential speedup of a computation when applying parallel processing techniques.

The law is often expressed in the form of a formula:

Speedup(S) = 1 / [(1 - P) + (P / N)],

where:

Speedup(S) represents the improvement in the execution time of a task when using multiple processing elements (such as CPU cores or threads) compared to using a single processing element.
P is the fraction of the task that can be parallelized (i.e., the portion of the computation that can be divided and executed simultaneously by multiple processing elements).
N is the number of processing elements (e.g., CPU cores) used for the parallel execution.

Amdahl's Law implies that even if you make a portion of a program fully parallelizable, the maximum speedup that can be achieved is limited by the non-parallelizable part of the program. In other words, there is a theoretical upper bound to how much a computation can benefit from parallelization.

For example, if 60% of a program can be parallelized (P = 0.6),then the maximum theoretical speedup can be calculated as follows:

Speedup(S) = 1 / [(1 - 0.6) + (0.6 / N)]

Let's say we have 4 processing elements (N = 4):

Speedup(S) = 1 / [(1 - 0.6) + (0.6 / 4)] Speedup(S) = 1 / [0.4 + 0.15] Speedup(S) = 1 / 0.55 Speedup(S) ≈ 1.82

In this example, the maximum speedup achievable with 4 processing elements is approximately 1.82 times faster than the sequential execution of the program.

Amdahl's Law is essential for understanding the trade-offs involved in parallelizing algorithms and choosing the appropriate number of processing elements to achieve the best performance improvement. It highlights the importance of optimizing both the parallelizable and non-parallelizable parts of a program to achieve significant speedup in parallel computing environments.

Let's reexamine the correct formulation of Amdahl's Law:

Speedup(S) = 1 / [(1 - P) + (P / N)],

where:

Speedup(S) represents the improvement in execution time achieved by using N processing elements for parallel execution compared to using a single processing element for sequential execution.
P is the fraction of the computation that can be parallelized.
N is the number of processing elements (e.g., CPU cores) used for the parallel execution.

The goal of Amdahl's Law is to quantify the potential speedup when parallelizing a computation. The speedup is defined as the ratio of the execution time for the sequential version (one processing element) to the execution time for the parallel version (N processing elements). The larger the speedup, the more the computation benefits from parallelization.

Now, if we consider the extreme case where the entire computation can be parallelized (P = 1),the speedup becomes:

Speedup(S) = 1 / [(1 - 1) + (1 / N)] Speedup(S) = 1 / [0 + (1 / N)] Speedup(S) = N.

In this scenario, the speedup is exactly equal to the number of processing elements (N). This is because the entire computation is parallelizable, and adding more processing elements directly reduces the execution time by a factor of N.

When P = 0, Amdahl's Law reduces to:

Speedup(S) = 1 / [(1 - 0) + (0 / N)] Speedup(S) = 1 / (1) Speedup(S) = 1.

In this case, the speedup is 1, which means there is no improvement in execution time as there is no parallelization. This serves as a reference point, and any speedup greater than 1 indicates that parallelization is beneficial.

The numerator of Amdahl's Law is not directly related to the original execution time without parallelization, but rather, it acts as a normalizing factor to provide a meaningful comparison between the speedup achieved with parallelization and the sequential execution time.

The formula "1 / ((1 - Fraction benefiting from GPU) + (Fraction benefiting from GPU / Speedup factor))" is not Amdahl's Law. It appears to be an attempt to calculate speedup when only a fraction of the program benefits from GPU acceleration, but the formula is not accurate.

To calculate the overall speedup gained while running a program on a system with a GPU compared to running it on a system without the GPU, you should use Amdahl's Law as follows:

Speedup (S) = 1 / [(1 - Fraction benefiting from GPU) + (Fraction benefiting from GPU / Speedup factor)]

Where:

Fraction benefiting from GPU: The fraction of the program that benefits from the GPU acceleration.
Speedup factor: The speedup factor provided by the GPU for the accelerated portion of the program.

Example :-

A new Graphics Processing Unit (GPU) is added to a system, which speeds up the execution of graphics-related instructions by 6 times. If a program has 50% graphics-related instructions, what is the overall speedup gained while running the program on the system with the GPU compared to running it on the system without the GPU?

Solution:-

Overall Speedup = 1 / ((1 - 0.50) + (0.50 / 6))

TOPIC – Loop unrolling and Loop Strip Mining

Loop unrolling involves expanding a loop's iterations by a fixed factor, executing multiple loop iterations within a single iteration. The primary goal of loop unrolling is to reduce loop overhead and enhance instruction-level parallelism.

for (int i = 0; i < N; i++) {

sum += array[i];

}

The above code can be written as:-

for (int i = 0; i < N; i += 2) {

sum += array[i];

sum += array[i + 1];

}

Strip mining, also known as loop tiling, involves dividing a loop into smaller sub-loops with a fixed number of iterations each. The primary goal of strip mining is to improve cache utilization and memory access patterns.

Dividing a loop that processes an array into tiles of size 4:

for (int tile = 0; tile < N; tile += 4) {

for (int i = tile; i < tile + 4 && i < N; i++) {

sum += array[i];

}

The code you provided is an example of loop strip mining, also known as loop tiling. It divides a loop into smaller sub-loops, or "tiles," and processes a fixed number of iterations within each tile. This technique is used to improve cache performance and memory access patterns by promoting spatial locality.

Let's break down how this code works step by step:

for (int tile = 0; tile < N; tile += 4):

This is the outer loop that controls the tiles. It iterates over the range of tile values from 0 to N - 1 with a step size of 4. This means each tile will process 4 iterations of the inner loop.

for (int i = tile; i < tile + 4 && i < N; i++):

This is the inner loop that processes the elements within each tile. The variable i starts from the value of tile and goes up to tile + 4 - 1 (i.e., it processes 4 iterations). However, there's also a condition i < N to ensure that the loop doesn't go beyond the array's bounds.

sum += array[i];:

Inside the inner loop, this line adds the value of the i-th element of the array to the sum. It accumulates the sum of the elements processed within each tile.

The purpose of this code is to break down a larger loop into smaller tiles, each of which processes a fixed number of iterations. This can improve cache performance because it promotes better use of cache lines. When a tile accesses consecutive elements of the array, those elements are more likely to be present in the cache, reducing memory latency and improving overall performance.

Here's a hypothetical example to illustrate how the code works:

If N is 12, the outer loop will iterate 3 times (0, 4, and 8).

Within each iteration of the outer loop, the inner loop will iterate 4 times, processing elements from the array in groups of 4.

The sum will be updated based on the elements processed in each tile.

In practice, the actual improvement in performance depends on factors such as the array size, cache size, memory access patterns, and the architecture of the target machine.

In the strip mining version, the inner loop processes elements in chunks of 4 within each tile. This has several cache-related benefits:

Spatial Locality: The inner loop processes elements that are close to each other in memory (within the same tile),promoting spatial locality. The elements accessed within a tile are more likely to be stored in the cache together.

Cache Line Utilization: Modern processors load data into the cache in fixed-size blocks called cache lines. When the loop accesses elements within a tile, multiple elements are likely to be loaded into the cache within the same cache line. This reduces the number of cache lines needed to satisfy the memory accesses.

Reduced Cache Thrashing: With the smaller chunks of data accessed within each tile, the chances of cache thrashing (frequent eviction and reloading of cache lines) are reduced. This is because the elements within a tile are being used more frequently before moving on to the next tile.

Fewer Cache Misses: By processing multiple elements within each tile before moving to the next tile, the loop minimizes the number of cache misses compared to the original loop, where each iteration accesses a single element.

Overall, the strip mining technique optimizes memory access patterns and cache utilization, leading to improved cache performance. The loop is designed to work with the cache structure more effectively, reducing memory latency and enhancing the efficiency of data retrieval from memory.

Finding the optimal tile size for loop strip mining requires careful consideration and experimentation. The tile size should be chosen to balance various factors, including cache size, memory access patterns, and the nature of the computation being performed in the loop. There's no one-size-fits-all answer, as the optimal tile size can vary based on the specific problem, the target architecture, and the available hardware resources.

Here's a general approach to finding an appropriate tile size:

Analyze Cache Hierarchy: Understand the cache hierarchy of the target architecture, including the sizes of L1, L2, and possibly L3 caches. Different levels of cache have different sizes and access latencies.

Consider Data Dependencies: If the loop has dependencies between iterations (e.g., data dependencies that prevent parallel execution),choose a tile size that minimizes these dependencies within a tile.

Experiment: Start with a reasonable initial guess for the tile size and measure the performance of your loop using that tile size. Monitor factors like cache miss rates, memory bandwidth utilization, and execution time.

Vary Tile Size: Experiment with different tile sizes, both larger and smaller, to observe how they affect performance. Measure the impact on cache behavior and overall execution time.

Cache Miss Analysis: Monitor cache misses (e.g., L1 cache misses) using performance analysis tools provided by the hardware or software. Aim to minimize cache misses, as fewer cache misses generally lead to better performance.

Memory Bandwidth: Consider the memory bandwidth of your system. Smaller tile sizes may lead to better memory bandwidth utilization, but excessively small tile sizes might lead to increased loop control overhead.

Profile Real Data: Profile your loop with representative input data that simulates the actual workload. This can help you choose a tile size that performs well for the specific problem you're solving.

Consider Parallelism: If your architecture supports multi-threading or vectorization, choose a tile size that aligns well with these capabilities to further enhance performance.

Iterate: Iteratively refine your choice of tile size based on the insights gained from experimentation and analysis. Aim for a tile size that maximizes cache utilization and minimizes memory access latency.

Remember that the optimal tile size might not be a fixed value; it could vary depending on the context and specific system configurations. Performance optimization often involves a trade-off between different factors, so it's essential to strike a balance that leads to the best overall performance for your specific application.

Apart from cache size, several other factors are important to consider when applying strip mining as a loop optimization technique. These factors can significantly impact the effectiveness of strip mining and the overall performance of your code:

Memory Hierarchy: Understanding the entire memory hierarchy, including L1, L2, and possibly L3 caches, main memory, and potentially even disk storage, is crucial. Different levels of memory have varying access latencies, bandwidths, and capacities that can influence strip mining decisions.

Data Dependencies: Analyze the data dependencies within the loop. Strip mining can introduce opportunities for parallelism, but if there are data dependencies between iterations, care must be taken to ensure correctness and avoid introducing race conditions.

Tile Size: The size of the tiles you choose can impact cache utilization and memory access patterns. A smaller tile size might offer better cache efficiency but might also increase loop control overhead. Experimentation is essential to determine the optimal tile size.

Vectorization and Parallelism: Strip mining can work in tandem with vectorization (SIMD operations) and parallelism (multi-threading). Consider how the tile size aligns with vector widths or threads in your architecture.

Loop Overhead: Strip mining can introduce additional loop control overhead due to the nested loop structure. The overhead might be acceptable for large iterations but could impact performance for small iteration counts.

Data Prefetching: Some architectures support hardware prefetching, where data is brought into the cache before it's actually needed. The effectiveness of prefetching can be influenced by the loop structure, memory access patterns, and tile size.

Cache Line Size: Cache lines are units of data transferred between memory and cache. Aligning the tile size with the cache line size can improve cache utilization by ensuring that each cache line is filled efficiently.

Branching: Consider the impact of branching within the loop on the effectiveness of strip mining. Frequent branching can disrupt the execution of multiple iterations within a tile.

Compiler and Architecture: Different compilers and target architectures might have varying levels of support for strip mining and related optimizations. Compiler flags and options can influence how well strip mining is applied.

Code Maintainability: While performance is a critical factor, code maintainability and readability are also important. Complex strip mining configurations can make the code harder to understand and maintain.

Realistic Workloads: Test your code with real-world data and scenarios. Performance might vary depending on the actual data distribution and usage patterns.

Trade-offs: Strip mining involves trade-offs between loop control overhead, cache utilization, and parallelism. Carefully weigh these trade-offs to achieve the best balance for your specific application.

In practice, it's often a combination of multiple factors that determine the success of strip mining. Experimentation, performance profiling, and understanding the underlying hardware architecture are key to making informed decisions about strip mining parameters and achieving the desired performance improvements.

TOPIC – Working of Tomasulo's Algorithm

Instructions:

ADD R1, R2, R3

SUB R4, R1, R5

STORE R4, Mem[R6]

Reservation Stations and Execution Units:

ADD Reservation Station

SUB Reservation Station

Store Reservation Station

Adder Functional Unit

Subtractor Functional Unit

Memory Unit

Cycle 1:

ADD R1, R2, R3 is issued to the ADD Reservation Station.

SUB R4, R1, R5 is issued to the SUB Reservation Station.

Cycle 2:

ADD R1, R2, R3 is dispatched to the Adder Functional Unit and starts execution as operands R2 and R3 are available.

SUB R4, R1, R5 is dispatched to the Subtractor Functional Unit.

Cycle 3:

ADD R1, R2, R3 finishes execution and writes its result to the ADD Reservation Station.

SUB R4, R1, R5 is waiting for R1 to be ready.

Cycle 4:

SUB R4, R1, R5 starts execution as its operand R1 is now available.

STORE R4, Mem[R6] is issued to the Store Reservation Station.

Cycle 5:

SUB R4, R1, R5 finishes execution and writes its result to the SUB Reservation Station.

Cycle 6:

STORE R4, Mem[R6] is dispatched to the Memory Unit, and data from R4 is stored in Mem[R6].

In this corrected example, out-of-order execution occurs as follows:

The ADD instruction (Step 1) begins execution in Cycle 2, even though the SUB instruction (Step 2) has been issued and dispatched in Cycle 1. This is possible because the ADD instruction doesn't depend on the result of the SUB instruction.

The SUB instruction (Step 2) begins execution in Cycle 4, even though the STORE instruction (Step 3) has been issued and dispatched in Cycle 4. This is possible because the SUB instruction depends on the result of the ADD instruction, which has now become available.

Out-of-order execution allows instructions that are not dependent on each other to be executed concurrently, improving overall processor utilization and performance. Tomasulo's Algorithm and associated hardware structures manage the dependency tracking, issue, dispatch, execution, and result broadcasting to enable this efficient out-of-order execution.

Article by Sumit Malhotra
Published 01 Jan 2024