

# MCFQ: Leveraging Memory-level Parallelism and Application's Cache Friendliness for Efficient Management of Quasi-partitioned Caches



D. Kaseridis, M. F. Iqbal, J. Stuecheli and L. John

Department of Electrical & Computer Engineering
The University of Texas at Austin

#### Introduction

- Previous work to address the cache contention problem:
- Cache Partitioning Scheme (UCP MICRO '061
- Inefficient use of capacity, not scaling to large CMP
- Cache Replacement policies
- Hard to provide control over cache capacity → interference
- Cache-block dead time prediction & Cache Pseudopartitioning
- · Destructive interference still presents
- · No absolute control of capacity per core
- · Oblivious to applications memory behavior
- Memory behavior is important in sharing cache capacity

Pseudo-partitions best compromise <u>IF DONE</u>

## **Motivation**

#### Memory Behavior of Application of Last-level Cache



Cache Thrashing (lbm, milc)

#### Requirements of efficient Cache Management scheme

- Cache space allocated to applications proportionally to the real benefit of using the capacity
- 2. Thrashing applications have to be isolated
- 3. Fitting applications have to be guaranteed a minimum capacity
- Cache capacity allocation priority scheme: Friendly > Fitting > Thrashing (few LRU ways)

Need a mechanism to allocate priorities within the same category of applications → Interference Sensitivity Factor

# **Profilers**

- Average MLP
- Use of MSHR (Miss Status Hold Register)
- Two extra counters: aggregated number of misses in structure, overall number of L1 misses added
- Memory Behavior
- · Using the Cache miss profilers from MSA circuit
- Thrashing: MPKI<sub>max.capacity</sub> > Threshold<sub>Thrashing</sub>
- Fitting: MKPI < Threshold<sub>Fitting</sub>
- · Remaining Friendly

#### Interference Sensitivity Factor

- When more than one applications with same memory behavior
- Based on sensitivity an application to cache contention & how friendly is in sharing capacity
- Estimate: Interference Sensitivity Factor =  $\sum_{i=0}^{\#_{NGNy}-1} i * hits(i)$  j=0 MRU



#### Partition Scaling

- Need to know the actual average occupancy of cache capacity in LLC
- Less than 80% of ideal → adjust Insertion Point (IPs) at next epoch
- Two counters per core:
  - Occupancy Counter: Number of actual ways occupied in cache for a core
  - Cache accesses Counter per core
- Important addition to achieve performance targets

# **Scheme**

#### MCFQ Policies

- Insertion Policy (Insertion Points IPs)
  - Use UCP to find ideal partition sizes
  - Applications Cache friendliness
    - Cache Friendly higher priority
    - Cache Fitting intermediate
  - Cache Thrashing restricted to LRU position
- Interference sensitivity factor and Partitions Scaling (next slide)
- Promotion: Every hit moves line to core's Insertion Point
- Replacement Policy: LRU of the whole cache set



# Overall Scheme



Example of operation on 4-core CMP 16-way LLC

#### **Evaluation**

#### Configuration

| Memory<br>Subsyste<br>m | L1 D+I                                 | L2                                                     | Main Memory                                       | Memory<br>Controller                                                  | Prefetcher                             |
|-------------------------|----------------------------------------|--------------------------------------------------------|---------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------|
|                         | 2-way, 64KB,<br>3 cycles, 64B<br>block | 16-way, 4M, 12<br>cycles, 64B<br>block, Pseudo-<br>LRU | 8GB, 16GB/s,<br>DDR3<br>1066-6-6, 16<br>Req./Core | 2 Controllers,<br>2 Ranks/<br>Controller, 32<br>Read/Write<br>entries | H/W stride<br>n,<br>8 streams/<br>core |
| Core<br>processor       | Frequency                              | Pipeline                                               | Reorder<br>Buffer /<br>Scheduler                  | Branch<br>Predictor                                                   |                                        |
|                         | 4GHz                                   | 30 stages /<br>4 wide fetch-<br>decode                 | 128 / 64 Entries                                  | Direct YAGS /<br>indirect 256<br>entries                              |                                        |

#### Experiments

|                                |                                      | 8 Cores                       |                                                                                 |  |
|--------------------------------|--------------------------------------|-------------------------------|---------------------------------------------------------------------------------|--|
| Bench. Group                   | Benchmarks                           | Bench. Group                  | Benchmarks                                                                      |  |
| Mix 1 – All Friendly           | soplex, bzip2, h264ref,<br>perlbench | Mix 1 – All Friendly          | soplex, omnetpp, peribench,<br>calculix, gromacs, dealtil, calculix,<br>gromacs |  |
| Mix 2 – All Fitting            | xalancbmk, wrf, tonto, gamess        |                               |                                                                                 |  |
| Mix 3 – All Thrashing          | leslie3d, sjeng, bwaves,<br>zeusmp   | Mix 2 – All Fitting           | xalancbmk, gobmk, wrf, gobmk,<br>hmmer, astar, gamess, hmmer                    |  |
| Mix 4 – 3 Fr. : 1 Fit          | omnetpp, bzip2, calculix, astar      |                               |                                                                                 |  |
| Mix 5 – 2 Fr. : 2 Fit          | bzip2, mcf, gobmk, gamess            |                               | omnetpp, bzip2, gobmk,                                                          |  |
| Mix 6 – 1 Fr. : 3 Fit          | omnetpp, xalancbmk, gamess,<br>wrf   | Mix3 – 4 Fr.: 2 Fit.: 2 Thr.  | gromacs, povray, h264ref, lbm,<br>libquantum                                    |  |
| Mix 7 = 3 Fr./Fit. : 1 Thr.    | mcf, peribench, hmmer, bwaves        |                               | mcf, gobmk, gromacs, hmmer,<br>gamess, tonto, libquantum, milc                  |  |
| Mix 8 = 3 Fr/Fit. : 2 Thr.     | xalancbmk, dealll, milc,<br>zeusmp   | Mix 4- 2 Fr.: 4 Fit. : 2 Thr. |                                                                                 |  |
| Mix 9 = 2 Fr. : 1 Fit : 1 Thr. | mcf, bzip2, astar, leslie3d          | Mix 5 – 2 Fr.: 2 Fit. : 4     | omnetpp, soplex, gobmk,                                                         |  |
| Mix 10 – 1 Fr. : 2 Fit: 1 Thr. | mcf, gobmk, gamess,<br>libquantum    | Thr.                          | gamess, libquantum, milc,<br>zeusmp, milc                                       |  |

#### Results



Throughput 4-cores - 19% LRU, 14% TADIP, 13% PIPP, 10% UCP



Throughput 8-cores - 20% LRU, 13% TADIP, 17% PIPP, 8% UCP



Fairness 4-cores - 17% LRU, 12% TADIP, 14% PIPP, 9% UCP



Fairness 8-cores - 15% LRU, 13% TADIP, 12% PIPP, 8% UCP

# Conclusions

- Using applications' Memory behavior is necessary in sharing cache pseudo-partitions
- Accurate monitors of memory behavior, interference sensitivity and average occupancy is necessary for precise control of partitions
- Careful management can provide most of benefits of isolated partitions while leading to scaling solutions

### **Acknowledgements**

This work is sponsored by the National Science Foundation under award 0702694 and CRI collaborative awards: 0751112, 0750847, 0750851, 0750852, 0750860, 0750868, 0750884, 0751091.