Execution strategies for memory-bound applications on NUMA Systems

Thesis submitted by Josefin Lenis for the degree of philosophae Doctor by the Universitat Autònoma de Barcelona, under the supervision of Dr. Miquel Angel Senar, developed at the Computer Architectures and Operating Systems department, PhD program in Computer Science (research line: High Performance Computing)

Barcelona, October 2018
Execution strategies for memory-bound applications on NUMA Systems

Thesis submitted by Josefina Lenis for the degree of philosophae Doctor by the Universitat Autònoma de Barcelona, under the supervision of Dr. Miquel Àngel Senar, developed at the Computer Architectures and Operating Systems department, PhD program in Computer Science (research line: High Performance Computing).

Supervisors

Dr. Miquel Àngel Senar

Phd Student

Josefina Lenis
Acknowledgements

First of all, I want to thank my advisor Miquel. Thanks to his support and guidance I discover a strength that I did not know that I have. His actions and advice marked me and became an example of fine academic work.

During my stay in the USA, I fell in love with a small city called Eugene in the state of Oregon. Where I still have very goods friends. The main responsible converting that time in an amazing experience was the Dr. Sameer Shende. To whom I will always be deeply thankful for his kindness and warm-hearted spirit. He invited me to work with his team, an incredibly smart and capable group of people, where I faced several challenges, and I’ve learned a lot. Thank you for the technical and moral support Sameer!

I want to thank each member of the Computer and Operating Systems Department. In particular, the people that stood by me in numerous occasions: Eduardo Cesar, Anna Sikora -thank you for always encouraging me to continue, your words and kind gestures help me more than I can express-, Remo Suppi, Lola Rexachs, Emilio Luque and Gemma Roque. My colleagues during some of this time Albert Gutierrez, Aprigio Bezerra, Francisco Cruz, Javi Navarro, Francisco Borges, Javier Panadero, Ferran Badosa, Hugo Meyer, Cecilia Jaramillo, Joe Carrión, Laura Espinola, and Jorge Villamayor.

A special mention for a close friend of mine in this Ph.D. adventure: Álex Chacón: I have no words to describe my gratitude to you. Your help was priceless. Thanks for your pieces of advice, your feedback, your extensive knowledge, for being there and for everything you have done for me. I truly admire you. I do not want to forget of my awesome friends Marcela Castro, Claudia Rosas, and Pili Gomez. Thanks for listening and thanks for reminding me what truly matters.

To my beloved husband Diego, my partner, my companion, my pillar and my rock. Thank you for supporting the long nights, for listening to the rehearsals of my presentations, for help me handle the stress, for taking care of me when I needed the most, I love you.
To my family, my dad and my mom, my examples of sacrifice, hard work, and enormous heart. To my unconditional siblings, nieces, and nephews, the Flanders family: Juan P., María, Emma, David, Caro, Charly, Juan F., Agus, Gonzalo, Facu, Guille y Román. I love you, and I miss you all.

Last but not least, I want to thanks all my friends that in one way or another helped this work to come through: Nuni, Fede, Andrés, Bet, Petra, Pedro, Prass, and Gero.
Abstract

Over the last several years, many sequence alignment tools have appeared and become popular thanks to the fast evolution of next-generation sequencing (NGS) technologies. Researchers that use such tools are interested in getting maximum performance when they execute them in modern infrastructures. Today’s NUMA (Non-Uniform Memory Access) architectures present significant challenges in getting such applications to achieve good scalability as more processors/cores are used. The memory system in NUMA systems shows a high complexity and may be the primary cause for the loss of an application’s performance. The existence of several memory banks in NUMA systems implies a logical increase in latency associated with the accesses of a given processor to a remote bank. This phenomenon is usually attenuated by the application of strategies that tend to increase the locality of memory accesses. However, NUMA systems may also suffer from contention problems that can occur when concurrent accesses are concentrated on a reduced number of banks. Sequence alignment tools use large data structures to contain reference genomes to which all reads are aligned. Therefore, these tools are very sensitive to performance problems related to the memory system. The main goal of this study is to explore the trade-offs between data locality and data dispersion in NUMA systems. We introduced a series of methodical steps to characterize NUMA architectures and to help understand the potential of the resources. With this information we designed and experimented with several popular sequence alignment tools on two widely available NUMA systems to assess the performance of different memory allocation policies and data partitioning and replication strategies. We find that there is not one method that is best in all cases. However, we conclude that memory interleaving is the memory allocation policy that provides the best performance for applications using large centralized data structured on a large number of processors and memory banks. In the case of data partitioning and replication, the best results are usually obtained when the number of partitions used is higher, and in some cases, combined with an interleaving policy.
Resumen

Durant els últims anys, moltes eines d’alineament de seqüències han aparegut i s’han popularitzat gràcies a la ràpida evolució de les tecnologies de Next Generation Sequencing (NGS). Evidentment, els investigadors que utilitzen aquestes eines estan interessats en obtenir el màxim rendiment quan les executen en infraestructures modernes. Actualment, les arquitectures NUMA (accés a la memòria no uniforme) presenten grans reptes en aconseguir que aquestes aplicacions tinguin una bona escalabilitat a mesura que s’utilitzen més processadors/nuclis. El sistema de memòria dels sistemes NUMA mostra una gran complexitat i pot ser la causa principal de la pèrdua del rendiment d’una aplicació. L’existència de diversos bancs de memòria en sistemes NUMA implica un augment lògic de la latència associada als accessos d’un processador donat a un banc remot. Aquest fenomen sol estar atenuat per l’aplicació d’estratègies que tendeixen a augmentar la localitat d’accés a la memòria. Tanmateix, els sistemes NUMA també poden patir problemes de contenció que es poden produir quan els accessos concurrents es concentren en un reduït nombre de bancs. Les eines d’alineació de seqüències utilitzen grans estructures de dades per contenir genomes de referència als quals totes les lectures estan alineades. Per tant, aquestes eines són molt sensibles als problems de rendiment relacionats amb el sistema de memòria. L’objectiu principal d’aquest estudi és explorar les compensacions entre la localitat de dades i la dispersió de dades en els sistemes NUMA. Hem introduït una sèrie de passos metòdics per caracteritzar arquitectures NUMA i per ajudar a comprendre el potencial dels recursos. Amb aquesta informació, hem dissenyat i experimentat diverses eines d’alineació de sequència populars en dos sistemes NUMA àmpliament disponibles per avaluar el rendiment de les diferents polítics d’assignació de memòria i les estratègies de partició i replicació de dades. Trobem que no hi ha un mètode que sigui millor en tots els casos. Tanmateix, es conclou que la intercalació de memòria és la política d’assignació de memòria que proporciona el millor rendiment quan s’utilitza una gran quantitat de processadors i bancs de memòria. En el cas de
la partició i la replicació de dades, els millors resultats solen obtenir-se quan la quantitat de particions que s’utilitza és més gran, de vegades combinada amb una política interleave.
Resumen

En los últimos años, muchas herramientas de alineadores de secuencias han aparecido y se han hecho populares por la rápida evolución de las tecnologías de secuenciación de próxima generación (NGS). Obviamente, los investigadores que usan tales herramientas están interesados en obtener el máximo rendimiento cuando los ejecutan en infraestructuras modernas. Las arquitecturas NUMA (acceso no uniforme a memoria) de hoy en día presentan grandes desafíos para lograr que dichas aplicaciones logren una buena escalabilidad a medida que se utilizan más procesadores/núcleos. El sistema de memoria en los sistemas NUMA muestra una alta complejidad y puede ser la causa principal de la pérdida del rendimiento de una aplicación. La existencia de varios bancos de memoria en sistemas NUMA implica un aumento lógico en la latencia asociada con los accesos de un procesador dado a un banco remoto. Este fenómeno generalmente se atenúa mediante la aplicación de estrategias que tienden a aumentar la localidad de los accesos a la memoria. Sin embargo, los sistemas NUMA también pueden sufrir problemas de contención que pueden ocurrir cuando los accesos concurrentes se concentran en un número reducido de bancos. Las herramientas de alineadores de secuencia usan estructuras de datos grandes para contener genomas de referencia a los que se alinean todas las lecturas. Por lo tanto, estas herramientas son muy sensibles a los problemas de rendimiento relacionados con el sistema de memoria. El objetivo principal de este estudio es explorar las ventajas y desventajas entre la ubicación de datos y la dispersión de datos en los sistemas NUMA. Hemos introducido una serie de pasos metódicos para caracterizar las arquitecturas NUMA y ayudar a comprender el potencial de los recursos. Con esta información, diseñamos y experimentamos con varias herramientas de alineación de secuencias populares, en dos sistemas NUMA ampliamente disponibles para evaluar el rendimiento de diferentes políticas de asignación de memoria y estrategias de replicación y partición de datos. Encontramos que no hay un método que sea el mejor en todos los casos. Sin embargo, concluimos que aplicar interleave a la memoria es la política de alocación
de memoria que proporciona el mejor rendimiento cuando se utiliza una gran cantidad de procesadores y bancos de memoria. En el caso de la partición y replicación de datos, los mejores resultados se obtienen generalmente cuando el número de particiones utilizadas es mayor, a veces combinado con una política de interleaving.
# Contents

1 Introduction 1  
1.1 Introduction ................................................. 2  
1.2 Contribution ..................................................... 3  
1.3 Related work ..................................................... 4  
1.4 Thesis Structure .................................................. 6  

2 HPC Systems 8  
2.1 Introduction ....................................................... 9  
2.2 Brief history of parallel architectures ............................... 9  
2.3 Parallel Architectures .......................................... 11  
  2.3.1 Shared-memory system .................................... 11  
  2.3.2 Distributed-memory systems ................................. 13  
  2.3.3 Hybrid systems ............................................. 13  
2.4 Study Case: NUMA Systems ..................................... 14  
  2.4.1 Our Infrastructure ...................................... 15  
2.5 Measuring performance .......................................... 16  
  2.5.1 Definitions .............................................. 16  
  2.5.2 Performance tools ...................................... 18  

3 Analysis Methodology 22  
3.1 Introduction ..................................................... 23  
3.2 Calculating Theoretical Bandwidth ............................... 24  
3.3 Measuring Latency .............................................. 25  
3.4 Measuring Bandwidth ............................................ 26  
3.5 Measuring Contention ......................................... 30  
3.6 Creating a guideline ......................................... 32  

4 Sequencing aligners 38
## 4.1 Introduction

## 4.2 Motivation

## 4.3 NGS Workflows

### 4.3.1 Sequencing

### 4.3.2 Alignment

### 4.3.3 Variant calling

### 4.3.4 Annotation

## 4.4 Sequence Aligners

### 4.4.1 Burrows Wheeler Aligner

### 4.4.2 Bowtie 2

### 4.4.3 Genome Multitool 3

### 4.4.4 Scalable Nucleotide Alignment Program

## 5 Experimentation

### 5.1 Performance of sequence aligners on NUMA systems

### 5.2 Allocation Strategies and Data Partitioning

#### 5.2.1 Analysis of memory allocation

#### 5.2.2 Data partitioning and replications strategies

### 5.3 Experimental Results

#### 5.3.1 Analysis of memory allocation policies

#### 5.3.2 Data partitioning and replication strategies

#### 5.3.3 Summary results

### 5.4 Annexed results

## 6 Conclusions and Future lines

### 6.1 Conclusions

### 6.2 Future Lines

## Bibliography
List of Figures

2.1 More’s Law vs Number of cores Source [1] ........................................ 10
2.2 A UMA architecture provides each CPU core (C1-C4) the same memory access latency and bandwidth. Source [2] ........................................ 12
2.3 A NUMA architecture connects different NUMA nodes (Nodes 1-4) - typically multicore CPUs (C1-C4) - via interconnect links (Link), to enable a single logically shared global memory. . Source [2] ................................. 12
2.4 In a distributed-memory model, the memory is physically and logically distributed among the individual processing units (P1-P3), e.g., CPUs. Accessing data from remote memory locations requires to initiate a data transfer protocol, such as point-to-point communication provided by the MPI. Figure source: [2] 13
2.5 In hierarchical systems the distributed- and shared-memory model are combined. In the depicted case three UMA-based shared-memory nodes are connected via a network connection, however, NUMA systems can be used in a similar manner. Figure source: [2] ........................................ 14
2.6 Schematic diagram of the AMD Opteron 6376 architecture (Abu-Dhabi) 15
2.7 Schematic diagram of the Intel Xeon E5 4620 architecture (Sandy Bridge) . 15
3.1 Latency profile of AMD Opteron 6376 using lmbench .......................... 26
3.2 All possible bandwidth values for one-processor execution on AMD system 28
3.3 All possible bandwidth values for one-processor execution on Intel system . 28
3.4 BW measures for all systems using 16 threads ............................... 30
3.5 BW measures for all systems using 32 threads ............................... 31
3.6 Concurrency stress test on AMD system ................................. 33
3.7 Concurrency stress test on Intel system ................................. 34
3.8 Methodology to obtain a guideline of NUMA execution strategies ...... 35
4.1 Standard variant calling workflow ........................................ 40
4.2 Single-nucleotide polymorphism ........................................ 44
# List of Tables

<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>Processor and memory information of the architectures used</td>
<td>24</td>
</tr>
<tr>
<td>3.2</td>
<td>STREAM’s vector kernels</td>
<td>27</td>
</tr>
<tr>
<td>5.1</td>
<td>Detailed information about the aligners</td>
<td>56</td>
</tr>
<tr>
<td>5.2</td>
<td>Instances created for each aligner</td>
<td>63</td>
</tr>
<tr>
<td>A3</td>
<td>Execution times for BOWTIE2 using different memory allocation policies</td>
<td>70</td>
</tr>
<tr>
<td>A4</td>
<td>Execution times for BOWTIE2 using data partitioning</td>
<td>71</td>
</tr>
<tr>
<td>B5</td>
<td>Execution times for BWA-MEM using different memory allocation policies</td>
<td>72</td>
</tr>
<tr>
<td>B6</td>
<td>Execution times for BWA-MEM using data partitioning</td>
<td>72</td>
</tr>
<tr>
<td>C7</td>
<td>Execution times for GEM using different memory allocation policies</td>
<td>73</td>
</tr>
<tr>
<td>C8</td>
<td>Execution times for GEM using data partitioning</td>
<td>73</td>
</tr>
<tr>
<td>D9</td>
<td>Execution times for SNAP using different memory allocation policies</td>
<td>74</td>
</tr>
<tr>
<td>D10</td>
<td>Execution times for SNAP using data partitioning</td>
<td>74</td>
</tr>
</tbody>
</table>
"Self-control means wanting to be effective at some random point in the infinite radiations of my spiritual existence."

- Franz Kafka
1.1 Introduction

NUMA (Non-Uniform Memory Access) systems have become increasingly common with the passage of time. Nowadays practically any HPC facility, research center or university has these systems in its infrastructure. Despite its complex hardware design, programs and applications do not have to be adapted to run on them (unlike other hybrid or semi-hybrid systems such as accelerators or GPUs). This feature is possibly one of its great advantages but at the same time its Achilles heel. Due that when applications are not aware of the hardware where they run on, they end up making sub-optimal use of resources. Writing parallel programs that exhibit good scalability is far from easy. Studying the physical characteristics of a NUMA architecture is key to understanding the impact that these can have on the performance of the applications that are run on. However, if programmers wrote extraordinarily optimal and architecture-dependent code, they would narrow the machines where such code could be executed on, so it is not a feasible solution either. Our initial hypothesis is that somewhere between these two points there is the proper balance: where the advantages surpass the drawbacks, by applying execution strategies to applications to optimize their execution time without losing flexibility.

In this research, we propose a benchmark suite to characterize NUMA architectures, through a set of well-known benchmarks. This information helps to see the limitations of our system for applications that intensively use the memory system, either in idealistic scenarios with purely sequential access patterns or unfavorable scenarios where accesses are entirely random. The acquired knowledge is translated into a guideline for final users of the NUMA system, to develop NUMA execution strategies for memory-bound applications. One central aspect of our proposal is that the user can take benefit of the architecture while improving the performance of the application used without modifying a single line of code.

To validate the NUMA execution strategies based on our methodology, we designed experimentation using real memory bound applications: genomic mappers. Most mappers take advantage of parallelization techniques to reduce the computational demands involved in the alignment of millions of reads onto a reference sequence. Numerous software tools have been developed in recent years, and many studies have evaluated their performances through several comparison criteria. In general, comparison studies have focused on mapper sensitivity and mapper accuracy, while computational and time requirements have received comparatively less attention.

Our proposed NUMA execution strategies mitigate two major drawbacks of these systems: the latency of remote data accesses and the concurrence of multiple threads contending for
a single shared resource. We executed the mappers using different techniques of memory allocation like interleave, but we also introduce a proposal based on multiple instances to reduce the intense usage of interconnection bus. As a result, we obtained improvements in the mappers performance of up to 5x. We also provide an easy to follow the guideline for the final user to understand the architecture been used and make the most of it.

1.2 Contribution

As a result of this study we published three papers:


In our first work, we analyzed the performance of BWA-ALN, (Burrows-Wheeler Aligner) [6], where we detected scalability problems exhibited by BWA-ALN, and we proposed simple system-level techniques to alleviate them. We obtained results up to a 4-fold speed higher than the original BWA-ALN multithread implementation.

In the second paper, we extended the study to other popular aligners from the literature. We analyzed performance problems of four aligners that constitute representative examples of the two most commonly used algorithmic strategies: hash tables and Burrow Wheeler Transform (BWT). The aligners under study were: BWA-MEM [7] (a newer version of BWA-ALN especially suited to dealing with longer reads), BOWTIE2 [8] (an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences), GEM (GEnome Multi-tool) [9] and SNAP (Scalable Nucleotide Alignment Program ) [10]. These aligners are widely used by the scientific community and real production centers, and frequently updated by developers. Although all the aligners under study take advantage of multithreading execution, they exhibit significant scalability limitations on NUMA systems. Data sharing between independent threads and irregular memory access patterns constitutes performance-limiting factors that affect the studied aligners. We have applied various memory allocation
policies as well as several data distribution strategies to these aligners, and we have obtained promising results in all cases, reducing memory-bound drawbacks and increasing scalability.

In the last paper, we also extend our previous findings by expanding our comparison study to two different NUMA systems, one based on Intel Xeon and the other one based on AMD Opteron, and by introducing a hybrid execution strategy that combines both data partitioning and memory allocation policies.

1.3 Related work

Since the emergence of the NUMA architectures in the 80s, there have been studies of the asymmetric accesses in these systems. Since then, NUMA systems have changed a lot compared to the old ones. And while many of the fundamental concepts that have been studied for decades remain, many others need to be revised in order not to fall into fallacies when it comes to predicting the performance of applications in these systems.

Both researchers and programmers alike, have helped to broaden the understanding of these systems and improve their use. Mainly through performance models, libraries, data mapping tools, and threads.

Between the most outstanding studies, three great categories can be described:

Threads and data mapping tools

Challenges in memory access on NUMA systems have been addressed by some approaches that tried to optimize locality at the OS level. The AutoNUMA patches for Linux [11] implement locality-driven optimization along two main heuristics. First, the threads migrate toward nodes holding the majority of the pages accessed by these threads. Second, the pages are periodically unmapped from a process address space and, upon the next page fault, migrated to the requesting node.

Carrefour [12] [13] is another recent tool that consists of a memory-placement algorithm for NUMA systems that focuses on traffic management. As in our approach, Carrefour focuses on memory congestion as the primary source of performance loss in current NUMA systems. It places memory to minimize congestion on interconnecting links for memory controllers. By using global information and memory-usage statistics, Carrefour applies three main techniques: memory collocation (to move memory to a different node so that accesses are likely local), replication (copying memory to several nodes so that threads from each node...
can access it locally) and interleaving (moving memory so that it is distributed evenly among all nodes).

AsymSched [14] is a dynamic thread and memory placement algorithm for Linux that takes into account the bandwidth asymmetry in NUMA systems. AsymSched is based on three techniques: thread migration, full memory migration, and dynamic memory migration.

In the three mentioned cases (AutoNUMA, Carrefour and AsymSched), memory management mechanisms are implemented in the Linux kernel and require a patch to be applied to the virtual memory layer. Our work, however, focuses on the evaluation of techniques that can be used at the application level and therefore don’t require root permissions to be applied to any NUMA system.

Application characterization tools

An interesting tool is Tabarnac [15], that not only identifies under-performance memory access patterns but also can provide with a solution of how to improve the source code of the program. Tabarnac is only used with benchmarks where a clear access pattern can be identified. However, when a complex combination of access pattern happens simultaneously, can’t offer a robust solution.

In [16] Majo et al. provides a series of guide-lines software-developer-oriented to achieve efficient use of NUMA systems. In this paper is remarked how the access pattern of a program has a significant impact on the overall performance.

Hardware characterization tools

In [17] the authors present a comprehensive study of 2 NUMA processors: Intel (Sandy Bridge-EP) and AMD (Bulldozer). They proposed a set of benchmarks to perform an in-depth analysis of current ccNUMA multiprocessor systems with processors. One of the most important contributions of this work is related to the specification of the cache-coherent protocols used by the different manufacturers.

NUMA Performance Models

A similar study to our work, Braithwaite et al. [18], who proposes an "Empirical Memory-Access Cost Model in Multicore NUMA ARchitecture”. Using STREAM benchmark [19][20] and LMBENCH [21], the author develops a benchmarking methodology, who through simple task-scheduling experiments to improves the performance of applications. Our study extends
this notion and also considers contention problems, not only memory-access cost due to latency or memory bandwidth. We also consider the importance of executing programs in instances as a possible outcome prediction of our model.

The performance model Roofline [22] is designed to assist in software development and optimization by providing detailed and accurate information about machine characteristics. The Roofline model is a visual performance model used to relate computational performance with the memory capacities of architectures. The Roofline model helps to find possible bottlenecks and to identify the performance boundaries for a given processor, however, does not provide any solution on how to solve such issues. Lorenzo et al. presented an extended version of the Roofline [23] to NUMA systems. In this work, the model was extended to show the dynamic evolution of the execution of a given code by successfully shown the impact of thread migration. Although this work is a useful tool to understand and characterize the behavior of the execution of parallel codes, only is tested with computational-bound applications.

**Mappers enhancer tools**

Genome alignment problems have been considered by Misale et al. [24]. The authors implement a framework to work under BOWTIE2 and BWA to improve the local affinity of the original algorithm. Herzeel et al. [25] replaces the pthread-based parallel loop in BWA with a Cilk for loop. Rewriting the parallel section using Cilk removes the load imbalance, resulting in a factor 2x performance improvement over the original BWA.

In both cases - Misale et al. and Herzeel et al. - the source code of the applications -aligners- are modified, which might be a costly action and dependent on the application version. Abuin et al. [26] presented a big data approach to solving BWA scalability problems. They introduce a tool named BigBWA that enables them to run BWA on several machines although it does not provide a clear strategy for dividing the data or setting the number of instances. In contrast, our approach can be applied to different aligners with minimal effort and, although not tested yet, it can be easily applied to distributed memory systems. Our work is complementary to all the works mentioned above.

**1.4 Thesis Structure**

The thesis is structured as follows: In Chapter 2 we explain the history of parallel systems briefly, then we describe the basic concepts of NUMA architectures and provide concrete
details of the two systems used in our experiments. In Chapter 3 we present our methodology to characterize the NUMA systems and obtain NUMA Performance Aspects. Chapter 4 introduces the problem of sequence alignment and behavioral characterization of aligners used in this study. In chapter 5 we describe in detail the NUMA execution strategies proposed to validate our methodology, we also explained all scenarios used to evaluate the performance improvement of aligners under study and analyzed the results obtained in our experiments. Finally, in Chapter 6 there is the conclusions and the future lines of work.
2

HPC Systems

"In the fight between you and the world, second the world."

- Franz Kafka
2.1 Introduction

This chapter briefly reviews some fundamental concepts related to high-performance computing. In particular, we present an introduction to the essential elements that characterize the architectures of current computers. The description starts with the analysis of the dominant architecture in current microprocessors and finishes by briefly reviewing the computer architectures that are commonly used to solve computationally complex problems. It sets the base to discuss NUMA architectures advantages and drawbacks in depth. Finally, a brief review of all the HPC performance tools used and evaluated during this study.

2.2 Brief history of parallel architectures

In 1965 Gordon Moore predicted that the number of transistors in an integrated circuit followed a progression in which that value doubled every 18 months. This prediction, known as Moore’s Law, has been fulfilled since then and continues in force. Translated regarding performance, this prediction has meant a constant increase in the computing capacity of successive generations of microprocessors thanks to the rise in the frequency of its clock and the integration of architectural solutions such as the parallelism at the instruction level that decreased the latency of the memory system. In this way, sequential programs could be executed more quickly, and this gain could be achieved transparently for the programmer. The progression in the improvement of the performance of the computers was altered at the beginning of the last decade by the physical limitations that derived from the problems of heat dissipation and consumption exhibited by the microprocessors. As of 2004, these limitations stopped the increase in the frequency of the processors and generated a technological solution that focused on the introduction of processors that instead of having a single processing unit (control unit plus logical arithmetic unit) by chip, they had several of them grouped in an entity called core.

In the [Fig. 2.1] the evolution of the integration levels of the processors, their frequencies and the number of cores is shown. Evolution in time of the number of transistors per chip (blue), the frequency (red) and the number of cores per processor (purple).

The architecture of a multicore processor is based on the existence of two or more execution cores within a single processor. The operating system perceives each of its execution core as a discrete logic processor with all its associated execution resources. Multicore processors perform more work per cycle, can operate at a lower frequency, and present...
improvements in the performance of calculation activities and bandwidth intensive.

The emergence of multicore processors has meant, however, a paradigm shift from the point of view of improving the execution of applications. This improvement is no longer achieved automatically by the effects of increases in the clock frequency but must be obtained by conveniently exploiting the inherent parallelism offered by the new processors in the form of multiple cores integrated into a single device.

However, unlike single-core processors, much of the responsibility for better performance now falls on the shoulders of programmers. The era where programmers relied on hardware designers to make their programs faster has ended [27]. The performance improvements in the new generations of computers will depend on the changes that are introduced in the applications and the system tools so that the computing capacity provided by the multicore systems is exploited.

Currently, Moore’s law is fulfilled at the level of cores in addition to the level of transistors. The increase in the average of cores is shown, since the appearance of the multicore architectures in 2004, in Supercomputers (infrastructures composed of thousands of multicore processors). The increase in cores has doubled every two years since the emergence of multicore processors in 2004 [28].

The limits that separate high-performance computing from ordinary computing are somewhat arbitrary since HPC refers merely to calculations that are made on computers more powerful than the standards [1]. With 500 u$s today you can buy a laptop that has more computing capacity, more main memory and a hard drive with more space than a computer of
2.3 Parallel Architectures

The parallel architectures are defined under the MIMD (Multiple Instruction Multiple Data) paradigm, where different instruction flows are applied to several data streams. In this section, we will review the three broad categories that parallel architectures usually are classified into:

- Shared-memory systems
- Distributed-memory systems
- Hierarchical (hybrid) systems

2.3.1 Shared-memory system

In shared memory systems, all CPUs (one or more) share the one physical memory address space. Within this category there are two important set of shared-memory systems that have different performance and different complexity:

- Uniform Memory Access (UMA): where latency and bandwidth are the same for all processors and all memory locations.
- Non-Uniform Memory Access (NUMA) in this case the memory is physically distributed through several processors but logically shared, so the memory addresses are global. Memory access times vary depending on the processor and the memory location that is intended to access.

For this paradigm the most popular programming libraries include OpenMP[29], Cilk[30] and Pthreads[31].

UMA

The most simple layout of the parallel system is the UMA. Processors access memory through the shared bus as seen in Figure 2.2. As stated before, all CPUs have the same latency and bandwidth when accessing memory. That is why are known as Symmetric multiprocessor (SMP) where systems use this centralized memory approach, where each processor is connected to a shared bus. This shared bus handles all accesses to main memory.
and I/O. Communication between CPUs is implicit and transparent. All these characteristics make it a very easy to program. Converting a serial code into a parallel code on UMA architecture requires a minimal effort. However, this symmetric pattern does not scale properly; there is a limited number of CPUs that can be added to a UMA architecture without turning the shared bus into a bottleneck. In order to maintain scalability of memory bandwidth with CPU number, non-blocking crossbar switches can be built that establishes point-to-point connections between sockets and memory module but at the cost of giving up the UMA principle [32].

**Figure 2.2:** A UMA architecture provides each CPU core (C1-C4) the same memory access latency and bandwidth. Source [2]

**Figure 2.3:** A NUMA architecture connects different NUMA nodes (Nodes 1-4) - typically multicore CPUs (C1-C4) - via interconnect links (Link), to enable a single logically shared global memory. Source [2]

**NUMA**

In NUMA systems, the main memory is physically distributed across the processors but, logically, this set of main memories appears as only one large memory, so the accesses to different parts are done using global memory addresses [1]. A processor and its respective memory are called a NUMA node. A program running on a particular processor can also access data stored in memory banks associated with other processors coherently but at the cost of increased latency compared to accessing its local memory bank. In general, parallel applications that may run using multiple processors are not usually designed taking the NUMA architecture into account, mainly because is still a shared-memory system and processors do not explicitly communicate with each other so communication protocols are hidden within...
In a distributed-memory model, the memory is physically and logically distributed among the individual processing units (P1-P3), e.g., CPUs. Accessing data from remote memory locations requires to initiate a data transfer protocol, such as point-to-point communication provided by the MPI. However in NUMA architectures that can have a high cost in performance if the data is not allocated optimally as too many remote access may occur.

### 2.3.2 Distributed-memory systems

In parallel distributed-memory computers, processing units do not share memory, but each has its own memory address space as shown in Figure 2.4. This model originates from a time where a CPU contained a single processing core. Such a setup is typically not found anymore in today’s cluster systems due to the advent of multicore CPUs. However, the model still serves well as an introduction to distributed computing, as the concepts are still applicable. When a parallel program is executed, the compute nodes exchange information between them by passing messages that are transferred over interconnection networks (Gigabit Ethernet, Infiniband, etc.). The dominant programming standard in this type of systems is MPI, although there are other alternatives such as co-Array Fortran, GASnet, Charm++ among others. In the distributed-memory system, there are no longer race conditions to access main memory. However, the communication between processes becomes explicit and need to be handled entirely by the programmer. Thus significantly increasing the programming complexity.

### 2.3.3 Hybrid systems

Nowadays, no system is purely shared-memory or strict distributed-memory, most configurations are the combination of both. Commonly composed by a layout of NUMA architectures...
connected over a fast network interface, as shown in Figure 2.5. The programming approach is also hybrid: using MPI for the inter-node communication and OpenMP for intra-node executions. The concept behind hybrid systems is broader than just multicore computers connected via a network; it is also used to name any system layout that mixes available programming paradigms on different hardware layers. An example could be a cluster built from nodes that contains, besides the “usual” multicore processors, additional accelerator hardware, ranging from application-specific add-on cards to GPUs (graphics processing units), FPGAs (field-programmable gate arrays), ASICs (application specific integrated circuits), co-processors, etc.

### 2.4 Study Case: NUMA Systems

NUMA arrived as a solution to bottlenecks produced by the intensive access to the single memory bus present on Symmetric Multi-Processor (SMP) systems. To profit from the scalability provided by NUMA systems, applications need to be aware of the hardware they are running on and the memory allocation policies applied by the operating system. Memory pattern accesses that imply memory transfers from non-local banks will negatively impact the overall execution time due to the distance penalty of remote memory banks. A second issue that can also increase execution time is contention; applications employing large numbers of threads and storing all data in a single bank, generate a race between threads, congesting the connection links and downgrading access times -local and remote- to memory.

A processor and its respective memory bank are called NUMA domain or NUMA node,
and we will use these terms indistinctly along this thesis. Another clarification that needs to be done respect naming it is related to which part of the NUMA node we are referring. For example, if we say ”threads are bound to NUMA node 0” we mean the processor, but if the phrase changes to ”data are placed on NUMA node 0” we mean the memory bank.

2.4.1 Our Infrastructure

As an example of NUMA systems, we can see the following two Figures (Figure 2.6 and Figure 2.7) that represent the two architectures we employed in this study, one manufactured by AMD and the other by Intel.

The first system, shown in Figure 2.6, is a four-socket AMD Opteron Processor 6376, with each socket containing two dies packaged onto a common substrate referred to as a Multi-Chip Module (MCM). Each die (processor) consists of 8 physical cores that share a 6 MB Last Level Cache (LLC) and a memory bank. Only one thread can be assigned to one core and, therefore, up to 64 threads can be executed simultaneously. The system has 128GB of memory, divided into eight modules of 16GB DDR3 1600 MHz each. The second architecture (Figure 2.7) is an Intel Xeon CPU E5 4620 -also four-socket. Each socket contains an eight core-processor and a 16 MB LLC. The total number of cores is 32. 64 threads can be executed simultaneously using HyperThread technology. This system also has a memory of 128 GB, but it is divided into four modules of 32GB DDR3. 1600MHz each.

Nodes are connected by links - HyperTransport (AMD) and QuickPath Interconnect (Intel). Memory bandwidth for both cases depend on the clock, the width of the interconnection links
(number of bits that can be sent in parallel in a single transfer), the data rate (single or double), and the actual bandwidth between the memory controller and the DDR3 memory modules [18].

**Allocation Policy**

Linux operating system uses Node Local allocation as the default allocation policy when the system is up and running. Node Local allocation means that when a program is started on a CPU, the data requested by that program will be allocated to a memory bank corresponding to its local CPU.

Specifying memory policies for a process does not cause any memory allocation [37]. Allocation policy takes effect only when a process first requests a page. It is known as the first-touch policy, which refers to the fact that a page is allocated based on the effective allocation policy when a process first uses a page in some fashion. Despite the default Linux policy, a programmer can set an allocation policy for its program utilizing a component of NUMA API [38] called libnuma. This user space shared library can be linked to applications and provides explicit control of allocation policies to user programs. The NUMA execution environment for a process can also be set up by using the `numactl` tool [38]. `Numactl` can be used to control process mapping to cpuset and restrict memory allocation to specific nodes without altering the program’s source code.

A popular memory allocation policy is interleaving where memory pages are allocated between selected nodes in a round-robin fashion. The idea behind it is that all used node all accessed equally as opposed to all the traffic being concentrated on one node.

### 2.5 Measuring performance

In this section, in the first place we will enunciate the definitions of the metrics used during this work, and later we will describe all the performance monitoring tools used and/or test for this work.

#### 2.5.1 Definitions

When we think about computer performance, we think of the effectiveness that a computer system completes a task. Depending on the context, 'effectiveness' may involve one or more of the following concepts:
• Short time for a given computational work
• High throughput (rate of processing work)
• Energetic efficiency of computing resource(s)
• High availability of the computing system or application
• Fast (or highly compact) data compression and decompression
• Short data transmission time

In HPC time is the most critical criteria to define a system performance. The computer that performs the same amount of work in the shortest time is the fastest.\textsuperscript{27} The execution time of a program is measured in seconds. In order to improve performance for a given application, it is necessary to monitor, analyze and tune the critical elements involved in the execution.

**Execution Time**

It is the measure that shows how long it takes to execute a process. It is defined as the interval between the start of the execution and the end of it. For parallel executions, the end of the execution is given when the last thread or parallel process that composes the application ends.

**SpeedUp**

It is a metric that characterizes the gain of running a program in parallel. Let $T(n,1)$ be the run-time of the fastest known sequential algorithm and let $T(n,p)$ be the run-time of the parallel algorithm executed on $p$ processors, where $n$ is the size of the input.

$$Speedup(n) = \frac{T(n,1)}{T(n,p)} \quad (2.1)$$

**Parallel efficiency**

It is obtained from the division between the Speedup and the number $n$ of computing elements with which the application is executed (see equation \textsuperscript{2.2}). The value of the efficiency can vary between 0 and 1. Being one (1) an ideal value that represents that the application is entirely parallel.

$$Parallel\text{ }efficiency = \frac{Speedup(n)}{n} \quad (2.2)$$
### 2.5.2 Performance tools

In this section, we will describe the tools used to measure the behavior of benchmarks and the applications tested as validation that later we will be discussed in Chapters 3. Two important features present in performance tools are: profiling and tracing. As stated Sameer Shende in [39].

“To understand the behavior of the parallel program, we must first make its behavior observable. To do this, we regard the execution of a program as a sequence of actions, each representing some significant activity such as the entry into a routine, the execution of a line of code, a message communication or a barrier synchronization. Performance analysis based on profiling and tracing involves three phases:

- instrumentation or modification of the program to generate performance data,
- measurement of interesting aspects of execution which generates the performance data and
- analysis of the performance data.

In the following subsections, we will provide a brief description of the hardware counters used and the tools we rely on to do our research.

**Hardware Counters**

Hardware counters are extra logic added to the CPU that track low-level operations or events that happen within the processor [40]. Traditional hardware performance counters measure only values on a single core. A chip package has many resources which are package-wide and thus need a separate performance reporting mechanism. The shared socket-wide values are called Offcore events are per-core values on their way to the uncore [41].

For all the experiments performed in this work, we measured the basic hardware given by perf tool [42]. These are:

- **context-switches**: when a task needs to be changed for while ensuring that the tasks do not conflict. A high number of contest switches could impact negatively the performance of an application.

- **CPU-migrations**: when a task is moved from one computing environment to another.
• **page-faults**: when a running program accesses a memory page that is not currently mapped by the memory management unit.

• **cycles**: usually, the time required for the execution of one simple processor operation such as an addition; this time is normally the reciprocal of the clock rate.

• **stalled-cycles-frontend**: measures all the cycles that are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions.

• **stalled-cycles-backend**: measures all the cycles that are a waste because Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.

• **instructions**: the number of instructions executed.

• **branches**: measures how many time the branch predictor tried to guess the correct branch.

• **branch-misses**: measures how many time the branch predictor failed to guess the correct branch.

• **LLC-loads-misses**: when the processor needs to fetch data that does not exist in the last level cache, so it brings it from memory.

• **cache-misses**: this event represents the number of memory access that could not be served by any of the caches.

• **seconds time elapsed**: entirely execution time.

**PAPI**

PAPI provides two interfaces to the underlying counter hardware; a simple, high level interface for the acquisition of simple measurements and a fully programmable, low-level interface directed towards users with more sophisticated needs. The low-level PAPI interface deals with hardware events in groups called EventSets. EventSets reflect how the counters are most frequently used, such as taking simultaneous measurements of different hardware events and relating them to one another.
TAU

TAU \cite{tau} is a performance evaluation tool that supports parallel profiling and tracing. TAU has become the standard within the HPC world, not only because of the quality of its results but also, for the low overhead, plus it is regularly updated. When profiling a program, shows you how much (total) time was spent in each routine and tracing shows you when the events take place in each process along a timeline. TAU can Profiling, and tracing can measure time as well as hardware performance counters from your CPU. TAU can extract

- It supports C++, C, UPC, Fortran, Python, and Java
- TAU runs on all HPC platforms, and it is free (BSD style license)
- TAU has instrumentation, measurement and analysis tools
- To use TAU, you need to set a couple of environment variables and substitute the name of the compiler with a TAU shell script

As a result of this research, TAU added a NUMA feature to measure remote/local main memory access ratio for a given parallel application.

Likwid

Likwid (Like I Knew What I’m Doing) \cite{likwid} is a tool suite to measure the different aspects that affect the performance of an application. It works on Intel and AMD processors on Linux systems. It offers several features, like measure performance counter, obtain architecture topology, core-binding, etc. Likwid is oriented to performance programmers and provides a less complicated interface. Using the tool suite on command line applications is simple. Although Likwid is mainly focused on x86 processors some of the tools are also portable, and not limited to, any specific architecture.
"Test yourself on mankind. It is something that makes the doubter doubt, the believer believe. "

- Franz Kafka
3.1 Introduction

The properties of the main memory system of an HPC architecture can have a drastic impact on the performance and efficiency of a program’s execution. In order to provide guidance to a user or programmer, we have created an empirical methodology. To characterize a NUMA architecture, we rely on four main features, that we call *NUMA performance aspects*:

1. Topology and the native features of the architecture
2. Latency
3. Bandwidth
4. Contention degradation

The first indicator, *Topology and the native features of the architecture* refers to information related to the hardware itself and it is provided by the manufacturer, such as the number of processors, cores and memory banks. From this information, we calculate the theoretical maximum memory bandwidth and determine the restrictions due to the interconnection layout. The following three indicators refer to the time elapsed since data is requested until it is retrieved by a CPU, *latency*; the amount of data delivered per time over a physical link, *bandwidth*; and the factor on how a multithread race condition affects the performance of a parallel application, *contention degradation*.

Taking into account the concepts described above, we propose a methodology consisting of a series of steps to obtain the critical information that allows us to characterize the architecture where the parallel programs will be executed. The result of carrying out these steps is to capture what we will now define as *NUMA performance aspects*.

This information is valuable, mainly at the user level, since specific parameters can be tuned when executing the programs -such as the data allocation policy or the creation of independent instances- to obtain better performance from the hierarchy of memory. But it could also be potentially used by a programmer because through the libnuma API tuning parameters can be applied at the code level. Without getting to the point of programming an algorithm that is architecture dependent, but considering *NUMA performance aspects* to better benefit from the hardware underlying.

Through greater detail, we can know precisely the difference between what the manufacturer claims and the reality. Technical specifications are not always met. Being aware of the real potential of our system, help us know where we are standing before executing a parallel...
application. In this chapter, we describe how to obtain each of the indicators and understand how they characterize the architectures.

### 3.2 Calculating Theoretical Bandwidth

Table 3.1 was built with the respective manufacturer’s specification [45] [46]

<table>
<thead>
<tr>
<th></th>
<th>Intel E5-4620</th>
<th>AMD OPTERON 6376</th>
</tr>
</thead>
<tbody>
<tr>
<td># of Cores</td>
<td>8</td>
<td>16</td>
</tr>
<tr>
<td># of Threads</td>
<td>16</td>
<td>16</td>
</tr>
<tr>
<td>Processor Base Frequency</td>
<td>2.20 GHz</td>
<td>2.30 GHz</td>
</tr>
<tr>
<td>Cache</td>
<td>16 MB SmartCache</td>
<td>2 x 8 MB (16 MB)</td>
</tr>
<tr>
<td>Bus Speed</td>
<td>7.2 GT/s QPI</td>
<td>5.2 GT/s HT</td>
</tr>
<tr>
<td># of Links (QPI or HT)</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>Memory Types</td>
<td>DDR3 800/1066/1333</td>
<td>DDR3 1600</td>
</tr>
<tr>
<td>Max # of Memory Channels</td>
<td>4</td>
<td>4</td>
</tr>
</tbody>
</table>

As a first step, we need to calculate the maximum theoretical bandwidth to identify the upper limit of our system. It is worth noting that these values are rarely reached. The calculation of this value assumes that the best scenario possible is reachable, that is to say, the program is compiled optimally, the prefetchers can anticipate most data request, the access pattern can be predicted, and so on.

\[
Max\ BW = \frac{Memory\ Frequency \times \#\ of\ Channels \times Word\ Size}{Byte}
\]

(3.1)

Using the information in Table 3.1 and expression 3.1, we obtain the following maximum theoretical bandwidth for the architectures described above:

- Intel: 1333 x 4 x 64/8 = 42.6 GB/s
- AMD: 1600 x 4 x 64/8 = 51.2 GB/s

These results reflect the value of the local memory access. Depending on the number of sockets of the system, we could theoretically increase this value. In our case, both architectures have four devices. So potentially a maximum memory bandwidth (Max BW) of 170.4 GB/s
(Intel) and 204.8 GB/s (AMD) if four sockets are accessing data allocated in their respective memory banks independently. Thus resulting in an unrealistic scenario, because it is highly unlikely to avoid out-of-socket-accesses that are restricted to the interconnection bus speed. It is possible to calculate interconnection bus bandwidth from Table 3.1 and considering that one hop accesses are performed using 16-bit links. The resulting values of this calculation are:

- Intel: $7.2 \times 2 \times 16/8 = 28.8$ GB/s
- AMD: $5.2 \times 2 \times 16/8 = 20.6$ GB/s

Another relevant fact to consider when calculating the theoretical values of memory bandwidths is that they are based on the assumption that all channels are occupied at the same time. For example, in the case of AMD, there are four channels. To strive to achieve the theoretical bandwidth there should be a physical memory module in each DIMM slot of the channel. A user can retrieve the information of how many channels are being used, using Linux command `dmidecode –type memory`.

Calculate the bandwidth is a good practice to know the optimal capacity of the system to use. Although it should be a more widespread practice among HPC system users, it is understandable that it entails some complexity.

### 3.3 Measuring Latency

We used the Lmbench [21] to measure latency, specifically the lat_mem_rd module. This test consists of measure memory read latency changing the memory sizes and strides. The entire memory hierarchy is measured, including onboard cache latency, LLC cache latency and main memory latency. The module lat_mem_rd uses a double nested loop. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point backward one stride. And then it goes through each element of the array in turn by doing

$$p = (char**) * p$$

The size of the array starts in 512 bytes up to final size provided as parameter. We used as final size 2 GB, large enough to test the main memory. For the small sizes, the cache will have an effect, and the loads will be much faster as the array increases its size and stride the latency will reflect that.
Latency

In Figure 3.1 we can see that the processor has a 16 KB L1 Data cache memory, 2 MB L2 and 8 MB L3. The nonexistence sharp edges in the transition from one level to the other may be attributed to cache utilization from other parts of the system, thus we assume the cache is not exclusively available for the program data. Specifically the L2, L3 cache is a unified cache and also used for the instructions. [47]

3.4 Measuring Bandwidth

To measure the memory bandwidth we rely on the widely-used STREAM benchmark [19] [20]. It measures sustainable memory bandwidth (in MB/s) and associates it with the corresponding computation rate for four simple vector kernels:

COPY, is the simplest and fastest test of the suite, as it does not involve any arithmetic operation. It merely brings two values from memory a(i) and b(i) and performs an update operation.
SCALE is a kernel modification of COPY by adding a simple arithmetic multiplication operation. It consists of obtaining two values of the memory, a(i) and b(i), and multiplies b(i) by a scalar before writing it in a(i). Although it is a simple scalar operation, more complex processes can be created from it. The performance of this test may be used as an indicator of the performance of more complex comparison operations.

In SUM, a third operand is added. As it fetches three values, it can potentially fill a processor pipeline for larger arrays. Thus one can evaluate the memory bandwidth by filling the processor pipeline or measure the final performance when the pipeline is full. This benchmark aims to approximate the actual behavior of a real application.

Finally, TRIAD allows chained, overlapped or fused, multiple-add operations. It builds on the SUM benchmark by adding an arithmetic operation to one of the fetched array values. For example, fused multiple-add operations (FMA) are essential operations in many basic computations, such as dot products, matrix multiplication, and polynomial evaluations. This benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the TRIAD memory bandwidth benchmark [48].

NUMA system has a performance sensitivity to the placement of threads and data. Changing these parameters can give very different outcomes, however exploring all possibilities can be unfeasible.

The AMD system has 8 NUMA domains; this means that if an 8-threads-program is executed (using one processor to bind all threads and one memory bank to host all data) we have 64 possibilities of parameters of execution.

We calculate the possibilities by considering all the variations allowing repetition and taking into account the order \( V'(8) = 8^2 = 64 \). Unfortunately, as the number of threads involved increases, also it does the possible combinations. In a 16 threads scenario we would need 2 processors, in this case we do not allow repetitions for the processors, however the

<table>
<thead>
<tr>
<th>name</th>
<th>kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td>COPY</td>
<td>a(i) = b(i)</td>
</tr>
<tr>
<td>SCALE</td>
<td>a(i) = q*b(i)</td>
</tr>
<tr>
<td>SUM</td>
<td>a(i) = b(i) + c(i)</td>
</tr>
<tr>
<td>TRIAD</td>
<td>a(i) = b(i) + q*c(i)</td>
</tr>
</tbody>
</table>

Table 3.2: STREAM's vector kernels
combinations increase to 224,
\[ V_k(n) = \frac{n!}{k!(n-k)!} = \frac{8!}{2!(8-2)!} = 28 \quad \forall \text{NUMA domains } \in (0,7) \]

and employing the same expression we can see that for 32 threads the number reaches 560 combinations.

Running a syntactic program, such as STREAM, gives us the opportunity to test most combinations and to profile the memory behavior without the need of executing higher time-consuming applications.

Algorithm 1 Execution of STREAM with 16 threads using numactl API (AMD)

**Data:** NUMA.domains = 0, 1, 2, 3, 4, 5, 6, 7

**for** processor1 in NUMA.domains **do**

**for** processor0 in NUMA.domains **do**

if $processor0 < processor1$ **then**

**for** memory_bank in (1, 2, 3, 4, 5, 6, 7) **do**

numactl —cpunodebind=$processor0, $processor1 —interleave=0, $node

.end

.end

.end

.end
The profiling process consists of testing all possible combinations for the 8, 16 and 32 threads. In the figures below, pointed out by the dark green diagonal, we can appreciate how the maximum bandwidth is reached when the data is fetched from the local memory banks.

In Figure [3.2](AMD) we have 8 NUMA domains, thus resulting in an 8x8 matrix. Light green areas surrounding the diagonal represent that accesses to processors on the same chip have an acceptable memory bandwidth. The matrix reflecting AMD system is not symmetric, as explained in Chapter [2] the interconnection bus has different bit wide links. For the Intel architecture, (Figure [3.2]) we observe a 4x4 matrix, and in this case, there is a remarkable difference between local allocation and remote nodes.

In AMD, memory bandwidth drops to half when data is changed from local to the on-chip neighbour (1 HOP), from 11000 MB/s to 5700 MB/s and it is reduced by a factor of 5.25x when the farthest distance-combination is employed. In Intel, on the other hand, the memory distance-penalty is even higher. When the data is local, the bandwidth is 16700 MB/s drops 5.5x when it is 1-Hop distance, and 6.4x when is 2-HOP distance.

For the following measures of bandwidth using 16 and 32 threads, we also considered the allocation policy. Given these scenarios, we could no longer guarantee that the accesses were 100% local. As remote accesses are unavoidable at this point, we evaluated the possibility of balancing the memory accesses by allocating the data onto two memory banks. Thus, the data is interleaved between bank 0 and the rest of banks. The algorithm [1] shows the steps followed to proceed with the benchmark of all the combinations.

It is clear that many of the combinations explored are sub-optimal and taking them into account when executing a program might not be the most efficient solution. However, at this phase of the methodology, the goal is to identify and report the sensitivity of the memory system to the allocation policies. The diagonal marked by local accesses is represented again in Figure [3.4a]. Now it is expressed by two cells, side by side. When the number of NUMA nodes matches the memory bank, the memory throughput reaches the maximum. A situation that on the Intel architecture is not similar (as shown in Figure [3.4c]). Our impression relies on the observation that independently of the combination set tested, there is always remote access happening, i.e., if the remote access is 1-HOP distance, then the cells are greener, but even then the difference of bandwidth is not significant.

When interleave policy is used, it can be seen the increment of the global memory throughput of the system; this can be observed graphically as the images get greener on both cases (Figures [3.4d] and [3.4b]). The Figures mentioned above show two straight yellow lines when using nodes 2-3 and 4-5, this can be related to the fact that the links between the nodes
Figure 3.4: BW measures for all systems using 16 threads

are not bidirectional. Thus, the access relationship between a processor and a memory bank is not bijective, i.e., accesses from the processor \( x \) to the bank \( y \) are not always equivalent to accessing the processor \( y \) and bank \( x \).

In the last scenario, we observed that as the number of threads increases, the bandwidth becomes more restricted by inter-node accesses. But as in the previous case, the global bandwidth of the system increases when the data is distributed using interleave instead of concentrating all the accesses to a single bank.

### 3.5 Measuring Contention

To measure contention, we used a benchmark developed by James Brock and modified by Andreas Klöckner \[49\]. We introduced some changes to enable certain parameters at
(a) BW values for 32 threads on AMD system with membind policy
(b) BW values for 32 threads on AMD system with interleave policy
(c) BW values for 32 threads on Intel system with membind policy
(d) BW values for 32 threads on Intel system with interleave policy

**Figure 3.5:** BW measures for all systems using 32 threads

execution time, which we will detail below. This benchmark, unlike STREAM, consists in generating random accesses to an array of memory, trying to trick the hardware prefetch and obtain a latency close to the real one. The random accesses are produced by multiplying the iterator by a prime number which is comfortably larger than the cache line. The following expression ensures that each byte is incremented exactly once.

\[
* ((char*)x) + ((j * 1009) \% N)) + = 1
\]  
(3.3)

This benchmark automatically creates the maximum number of threads possible for a given system. Each thread is pinned to its matching core. It allocates the array on node 0, and then each thread traverses the array using the expression 3.3 in three different modes: sequential (one thread at the time), two simultaneously, and all at once. As each thread ends
up accessing the array, the bandwidth is calculated from all possible positions towards the memory bank 0 with expression \[3.4\].

\[
\text{bandwidth}^* = \frac{\text{array size} \times \text{ntrips} \times \text{cache line size}}{\text{time}}
\]  

(3.4)

Our modifications to the benchmark consist of passing as a parameter the number of memory bank where the array is allocated, plus adapting the pinning process of the threads to the cores for both available architectures (AMD and INTEL) as the number of cores changes depending on the manufacturer and the processor used.

With these changes, we can now calculate the bandwidth from any core to any memory bank and observe the effect on the final bandwidth of the intensive use of interconnection buses as the number of threads increases.

These tests show how they affect the fact that many threads simultaneously try to access the same memory bank. For this testbed, we used the cases of 8, 16, 32, 48 and 64 threads. The behavior concurs with our explanation in the previous section. For AMD Figure 3.6 with eight threads (blue line) you can see how the maximum bandwidth is achieved when data is entirely local, this only can be accomplished with eight threads. In the case of 16 threads (orange line) we will have remote accesses, thus the bandwidth, even in the best situation, is 40\% worse per thread. This behavior can be extrapolated by increasing the number of threads, obtaining each time a lower bandwidth memory per thread. For 64 threads the response is practically constant, no matter where the data is stored, it will always consume the maximum access time (unless data are spread system-wide). For Intel, Figure 3.7 the behavior is quite similar, with the particularity that the step change between the usage of only local node and combination of local and remote is more clearly exposed for eight threads (blue line).

### 3.6 Creating a guideline

In this chapter, we described our steps to characterize two common commercial NUMA architectures. These steps can be extrapolated characterize other NUMA systems. The process is based on the use of well-known benchmarks such as STREAM and lmbench, and on the manufacturer’s technical specification. The steps are easy to implement. Thus an end user can get the following NUMA performance aspects from the system quickly:

- First, the theoretical ceiling. Know what is the maximum value of memory bandwidth that we can obtain. At this point, we also highlight which are the relevant data
in the technical specification that allow us to calculate such bandwidth.

- The next step is to know the latency through lmbench. The memory access latency is complementary information that outlines the memory hierarchy and quantifies the cost of not using the cache system correctly.

- The third step is to know the parallel bandwidth. Here is where the user can recognize what a NUMA architecture implies. Using STREAM, with the presented execution algorithm, we make a sweep of all possible combinations of data placement and threads, getting a variety of bandwidth. As the number of threads increases it does also the number of possible place positions, creating subsets that are optimal for execution and need to be taken into account when running a parallel program. Additionally, the implementation of interleaving policy can have a substantial impact in it.

- In the last stage, the effect of contention is analyzed. It is perceived a trade-off game between the number of threads and total system bandwidth, i.e., how the performance is degraded by every single thread but increases throughput at the
Figure 3.7: Concurrency stress test on Intel system

system level to a certain limit.

All these steps were manually implemented through scripts that can be accessed from our GitHub repository:

https://github.com/jlenis/NES

Unlike accelerators of GPUs, there is no need to introduce modifications into the code to run a program on NUMA architectures. Nevertheless, the user must be aware that this is a hybrid system. We do not advocate for the notion that a programmer should write code for a given architecture in mind. However, all the complexity that the programmer is saving in the program will be later transferred to the final user as execution time penalty. To solve this, we propose a guideline at the user level, acquiring the hardware insights from the **NUMA performance counter** plus the additional knowledge from the user about the target application. At this point, we have covered the first points of the methodology and we proceed to the step (4) as depicted in Figure 3.8

For the NUMA systems characterized, these are the main insights:
• Both systems are susceptible to memory allocation policy.

• Theoretically AMD system has better bandwidth than Intel’s, yet the later benchmarked better results.

• Both are sensitive to contention degradation when data is centralized.

• Intel has better local access bandwidth but suffers from a harder drawback when data is accessed remotely.

• In AMD the distance or HOPS is dictated by the number of links between nodes instead of the layout.

It is of our belief that there is a sweet spot between the two extremes points of (i) programming an application dependent on the architecture and (ii) not considering the hardware features at all. That is a series of parameters adjustments that can be done at
user lever to reduce the drawback of NUMA effects. To validate this hypothesis we want to evaluate the comportment of real parallel memory-bound applications on the systems studied. We opted for sequencing aligners due to the vast relevance of genomic workflows in the scientific community. Such programs have several algorithmic phases, and even though they are labeled as mainly memory-bound, they also have a significant computational stage and non-depreciable I/O interaction. Everything related to the sequencing aligners and its background will be explained adequately in the next chapter.
"Don’t despair, not even over the fact that you don’t despair. Just when everything seems over with, new forces come marching up, and precisely that means that you are alive."

- Franz Kafka
4.1 Introduction

Deciphering the DNA through its variations has become an essential study that provides relevant results to all branches of biology and medicine. The advancement of technology has resulted in a reduction of costs in the so-called new generation sequencers, enabling to a large part of the scientific community have large volumes of data for research.

In this chapter, we will explain conceptually basic principles of the human genome study. In a first instance, a global introduction to the subject and then detailing each of the stages involved in the process, to provide a broad context to sequence alignment. Finally, we will describe all the short-read mappers used as a test bed to validate the NUMA execution guideline.

4.2 Motivation

The information contained in the chromosomes is what determines the characteristics of human beings, from our eye color to the disorders that we may be prone to suffer. Among the many possibilities that arise from this field of study, perhaps one of the most the resonance of personalized medicine. The variations in the genome of individuals can be associated with certain diseases and its analysis enables the creation of treatments adapted to the specific characteristics of a patient’s genome.

The genomic studies that have been carried out since the middle of the 70s were possible thanks to a set of biochemical techniques that obtained sequences of nucleotide bases from of a organic tissue. These techniques, known as sequencing, evolved rapidly to this day. Since approximately ten years new tools were introduced to the scientific community of sequencing, developed mainly by Illumnia, SOLiD Systems, 454 Roche and HeliScope. These New Generation Sequencers (NGS) have drastically improved the amount and quality of the output data, and significantly reducing the costs.

As a consequence, not only the laboratories with the latest technology have this equipment, but it is increasingly common that institutes, hospitals and universities acquire sequencers that generate massive data sets. Taking into account that many of these institutions have modest clusters composed of common PCs (Commodities PC) and that the sequencers mentioned above they have the capacity to produce data of the order of one billion of base pairs per day [6], the challenge is then in which the computational applications have sufficient capacity to deal with the analysis of this data in an efficient way.
4.3 NGS Workflows

Genome investigations are mainly based on the study of variations in DNA sequences. These variations can be identified by comparing to a reference human genome [50] or comparing sets of genomes of individuals to each other. According to the type of mutation and how it is expressed in the human genome, it is possible to relate them to certain diseases or disorders. Be able to determine if these variations can be associated to hereditary disorders (Mendelian), to complex diseases or somatic mutations, among others, is the result of a process analysis of the data obtained with NGS.

The entire process or workflow is very complex as it depends on many programs and databases and involves the handling of vast amounts of heterogeneous data. Thanks to the growing interest in projects NGS, it is understandable the appearance of numerous tools (private and opensource) that allow performing specific tasks within the workflow [51].

The choice of tools to use depends mostly on the biological results that are sought, so workflows are rarely the same. However, you can perform an abstraction and say that most of the workflows share the same conceptual stages.

4.3.1 Sequencing

The sequencing stage consists in the transformation of a tissue biological in DNA sequences, called readings (reads). The length of the reads is measured in the number of nucleotide bases it has. These fragments are combined in files of fastQ format. Some of the techniques used in
this process are:

(a) Pyrosequencing.

(b) Sequencing by synthesis.

(c) Sequencing by ligation.

(d) Sequencing of a single molecule, real-time sequencing.

(e) Sequencing Post-light.

The sequencing process follows a standard protocol for its realization. However, specific parameters can be customized to obtain reads with particular characteristics. These characteristics affect all stages of the workflow:

**Reads length**

During the sequencing, it is possible to specify the number of readings. The more extensive the readings, the more reliable they are. However, they imply a higher cost of generation and analysis. They usually range from dozens up to several hundred bases.

**Single-end y Paired-end reads**

In the single-end readings, the sequencer reads from one end of one molecule to the other, generating a sequence of even bases. In the paired-end readings, the sequencer starts at the end of a fragment and ends its reading in a specified number of bases and there begins a new reading from the opposite end in the direction opposite to the first. As the molecules have a length of two times greater than an average sequencer reading (Illumina in this case), the paired-end readings do not overlap. The data sequenced in paired-end readings come in two files. One that saves the sequences in one direction and the other stores in the order reverse.

Paired-end readings are better for identifying positions relative to numerous readings within the genome, allowing an effective way to solve structural arrangements such as insertion, the elimination or investments of genes.

**Coverage**

The coverage measures the number of times a specific site in the genome. It is sequenced during the sequencing process. It can be calculated as:
coverage = \( \frac{\text{No total of generated bases}}{\text{Size sequenced genome}} \)

For example, 30x coverage, which in some cases can be considered high coverage, means that on average each base is reported 30 times. This average does not say that it is made as an equal bread basket. Some bases can be sequenced 70 times and others 1 or 2, or even an arbitrary number of times. The higher the coverage, the more reliable the sequencing.

**Genome vs. Exome**

Human genome sequencing has reduced its costs with the introduction of new generation technology, however, remains expensive on a large scale. That’s why many researchers are inclined to use the exome sequencing. The exome is the coding part of the genome; it represents approximately 1.5% of it. A representation using smaller files, allows the exomes to work with coverage higher than the genomes. The exome sequencing was already used to identify molecular defects of a single gene, disorders heterogeneous and improve the accuracy of patient diagnosis.

### 4.3.2 Alignment

The alignment or alignment stage is the process in which determines the position corresponding to each reading of nucleotide bases. There are two main approaches to solve the alignment problem, which is based on using a reference genome (assembled by mapping) or dispense with it (assembled *de novo*). The *de novo* assembly consists of in determining the order of the fragments or readings, using overlapping of common regions to join them. The continuous sequence of readings is called “contig”. The contig length that an assembler can produce is limited by the number of repeated sequences, polymorphisms, missing data and errors [52].

In the assembly by mapping, each reading is compared to a reference genome, and there is where your position is determined. It is clear that when new species are investigated, the use of *de novo* alignment is required as there is no reference genome. But in the case of study of the human being, the decision on which method to choose becomes more complex. In *de novo*, all the readings are compared to each other. As a result, a much more precise process is obtained [53] yet it is computationally more demanding. It is harmed by readings too short, as this increases the possibility that more readings are overlaid in a non-univocal way. That is the reason why for short readings from 35bp to 100bp, the use of assemblers predominates by mapping that is faster, yet as a counterpart the assemblers that use a reference genome have
polarized results since readings that look more like the reference sequence are more likely to be mapped successfully compared to readings that contains valid differences [54].

Using the assembly by mapping would imply that to discover the position correct one reading should be compared to three billion of possible positions within the human genome. The current programs they use indexing techniques to optimize their mapping algorithms. The two most commonly used approaches [55] by The aligner programs are:

**Hash table based algorithms (paced-seed indexing):**

In this method, each position in the reference is fragmented into four pieces of equal size called seeds and said seeds they are paired and saved in a table of queries. As a result, you get six combinations of seed pairs for each possible position. Each reading is also fragmented in the same way where the seed pairs are used as a key to search for locations that correspond in the reference algorithm.

**Prefix/Suffix Tree based algorithms (Burrows-Wheeler Transformed):**

The use of the “Burrows-Wheeler transform” stores representations of the reference genome whose memory utilization is very efficient. Implementations of this method consume about 2GB of RAM to represent the complete genome, compared to the technique of spaced seeds that requires approximately 50GB. Readings are aligned character by character, from right to left using as a reference the transformed genome. Each new character aligned allows you to reduce the list of possible positions that may correspond to you to reading. Thus resulting in a significantly more complex algorithm but it runs faster.

### 4.3.3 Variant calling

In a few words at this stage, it is about determining the possible variants that an individual presents concerning a specific reference, and see how they relate to the biological phenomenon that is being investigated.

The set of variations that occur in a single nucleotide base they are known as SNV (Single Nucleotide Variation), within this category are the SNP single nucleotide polymorphisms (Single Nucleotide Polymorphism) and the insertions/deletions of a single INDELs base (InsertionDeletion). The second set of variations are structural variations SV (Structural Variation), which involve too large blocks (more than 100 bases) of DNA. In this category, there are duplication, INDELS, investments, translocations, among others [56].
Before the emergence of NGS technologies, genome studies human beings positioned the SNP variations as the main source of the phenotypic and genetic variations. However, in the past years has been shown, that there are large structural variations in the human genome [57]. The evidence indicates that the variations can be associated with diversity and susceptibility to diseases, despite the fact that most structural variations they are outside the coding region. The tools for variant, differ according to the types of variations what you want to search. But there is also another classification that is born of the biological phenomenon that you want to analyze:

1) **Determination of germline variants**, which is central to find the causes of rare and characteristic diseases genetic. It is obtained by comparing the tissue of the individual in question with the reference genome.

2) **Determination of somatic variants**, that are searched for diseases like cancer. But, in this case, a comparison is made between the results of the sequencing of healthy tissues with those of carcinogenic tissues of the same individual.

### 4.3.4 Annotation

After identifying the variations, it is necessary to be able to predict its functional impact. Expressed otherwise associating the variation to its corresponding biological information. For that, the tools who make the annotations, they usually compare the variations found with public databases like dbSNP [58] that provides more than 20,000 validated human SNPs, there are other bases of data such as HGMD (Human Gen Mutation Database) [59] that contains more of 76,000 mutations of approximately 2900 genes, or the COSMIC (Catalog of Somatic
4.4 Sequence Aligners

Sequence aligners - or aligners, for the sake of simplicity - can be classified into two main groups: based on hash tables or based on Suffix/Prefix trie. In “A survey of sequence alignment algorithms for next-generation sequencing” Li et. al. [61], define very precisely each:

- **Hash Table:** “The idea of hash table indexing can be tracked back to BLAST. All hash table based algorithms essentially follow the same seed-and-extend paradigm. BLAST keeps the position of each k-mer (k = 11 by default) subsequence of the query in a hash table with the k-mer sequence being the key, and scans the database sequences for k-mer exact matches, called seeds, by looking up the hash table.”

  SNAP is an example of a hash table-based aligner, where given a read to align draws multiple substrings of length s from it and performs an exact look-up in the hash index to find locations in the database that contain the same substrings. It then computes the edit distance between the read and each of these candidate locations to see the best alignment.

- **Suffix Trie:** “reduce the inexact matching problem to the exact matching problem and implicitly involve two steps: identifying exact matches and building inexact alignments supported by exact matches. To find exact matches, these algorithms rely on a certain representation of suffix/prefix trie.”

  Examples of BWT-based aligners are BWA, BOWTIE2 and GEM. Hash tables are a straightforward algorithm and are very easy to implement, but memory consumption is high; BWT algorithms, on the other hand, are complicated to implement but have low memory requirements and are significantly faster [55].

  On the other hand, BWT is an efficient data indexing technique that maintains a relatively small memory footprint when searching through a given data block. BWT is used to transform the reference genome into an FM-index, and, as a consequence, the look-up performance of the algorithm improves for the cases where a single read matches multiple locations. “The advantage of using a trie is that alignment to numerous identical copies of a substring in the reference is only needed to be done once because these identical copies collapse on a single
path in the trie, whereas with a typical hash table index, alignment must be performed for each copy.’

The computational time required by an aligner to map a given set of sequences, and the computer memory required, are critical characteristics, even for aligners based on BWT. If an aligner is extremely fast, but the computer hardware available for performing a given analysis does not have enough memory to run it, then the aligner is not very useful. Similarly, an aligner is not helpful either if it has low memory requirements and it is very slow. Hence, ideally, an aligner should be able to balance speed and memory usage while reporting the desired mappings \cite{62}. In \cite{24}, Misale et al. define three distinguishing features among the parallelization of sequence aligners:

1. There is a reference data structure indexed (in our study, the human genome reference). Typically this is read-only data.

2. There is a set of reads that can be mapped onto the reference independently.

3. The result consists in populating a shared data structure.

From a high-level point of view, this is the behavior of all the aligners that we used in this study. Therefore, continuous access to the single shared data structure -index- by all threads can increase memory performance degradation. Additionally, read mapping exhibits poor locality characteristics: when a particular section of the reference index is brought to the local cache of a given core, subsequent reads usually require a completely different part of the reference index and, hence, cache reuse is low.

### 4.4.1 Burrows Wheeler Aligner

Burrows-Wheeler Aligner (BWA) is one of the most used short-read mappers by the genomic community. BWA uses the BWT on a reference genome to create an index of it, and two auxiliary data structures. These data structures are used to calculate the range where a read sequence matches the reference genome, so they are accessed every time a read is analyzed. Once the index of the reference genome is allocated in memory, BWA takes each read and calculates its suffix array coordinates using BWT and the reference genome index. There is no dependency between reads so that several reads can be aligned in parallel. BWA allows multithreaded execution using pthreads. Each thread executes a sequential version of the core mapping algorithm, while main data structures are allocated at the beginning of execution.
before the rest of threads are created \cite{25}. For this work, we centered on BWA original algorithm \textit{aln} \cite{6}.

### 4.4.2 Bowtie 2

Bowtie 2 is a tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and paired-end alignment modes. Multiple processors can be used simultaneously to achieve greater alignment speed. It runs on the command line under Windows, Mac OS X and Linux.

### 4.4.3 Genome Multitool 3

Genome Multitool 3 (GEM3) is a high-performance mapping tool for aligning sequenced reads against large reference genomes (e.g., the human genome). In particular, it is designed to obtain the best results when mapping sequences up to 1K bases long. GEM3 indexes the reference genome using a custom FM-Index design and performs an adaptive gapped search based on the characteristics of the input and the user settings. GEM3 allows performing complete searches that find all possible matches according to the user criteria. It enables the user to perform searches looking for the first-match/best-match, all-first-n matches, and all matches. GEM3 supports single-end and paired-end mapping modes. Also, it helps both global-alignment and local-alignment models for different error models. It offers an out-of-the-box multithreaded method to exploit several cores processors and achieve higher speed, without affecting the relative order of the reads. GEM3 is distributed under GPLv3 and runs on command line under Mac OSX and Linux.

### 4.4.4 Scalable Nucleotide Alignment Program

Scalable Nucleotide Alignment Program (SNAP) is a sequence aligner that is 3-20x faster and just as accurate as existing tools like BWA-mem or Bowtie2. It runs on commodity x86 processors and supports an error model that lets it cheaply match reads with more differences from the reference than other tools. SNAP search algorithm is based on a hash table. SNAP fast performance is related to the longer length of the reads that it processes which allow for
quick hash-based location of reads using larger “seed” sequences. It also meant to be executed on machines with at least 10 GB of RAM to take profit of the increased server memories, which allow trading memory to save CPU time.
"You can hold yourself back from the sufferings of the world, that is something you are free to do and it accords with your nature, but perhaps this very holding back is the one suffering you could avoid."

- Franz Kafka
5.1 Performance of sequence aligners on NUMA systems

In Chapter 3, we presented a series of studies that can be used to characterize NUMA architectures. From there we analyzed the NUMA performance aspects to understand the functioning of the system. We examine how this acquired knowledge can give us answers to possible problems that could be experienced by applications that are not NUMA-aware but that are executed on such systems. In this chapter, we propose to study the behavior of sequence aligners, which are real applications and mainly memory-bound and evaluate their level of NUMA-awareness. We will first start with BWA, which is one of the most adopted sequence aligners by the scientific community.

As explained in Chapter 4, BWA uses the Burrows-Wheeler Transformation of the reference genome, and two auxiliary data structures: c_tab and o_cctab. These data structures are used to calculate the range where a read sequence matches the reference genome, so they are accessed every time a read is analyzed. There is no dependency between reads so that several reads can be aligned in parallel. BWA allows multithread execution employing pthreads. Each thread executes a sequential version of the BWA algorithm. In BWA, the tables c-tab and o_cctab are allocated at the beginning of the execution in a single node, before the alignment starts. That means that all pthreads which are not on the local socket will have slower access time to these tables. BWA counts with three algorithms: bwa-backtrack, bwa-sw, bwa-mem.

In this section, we will focus on BWA’s original algorithm bwa-backtrack. BWA has three components: index, aln, and sampe/samse. The first component is used to index the reference genome, employing Burrows-Wheeler Transformation. It only needs to be done one time, due that the same reference genome can be used for all the executions of BWA. The aln command takes the reads and calculates their suffix array coordinates using BWT and the reference genome index. In the last sub-module samse -for single-end reads- and sampe -paired-end reads- the chromosome coordinates are created. BWA implements parallelization using pthreads library but only on the ALN component. To obtain the minimum execution time possible, one could think that the best strategy is to use as many threads as cores available. However, the results obtained are not consistent with this hypothesis. Execution times for BWA-ALN multithread using up to 64 threads are displayed in Figure 5.1.

The Figure 5.1 reflects the execution times as the number of threads increases on the AMD system. The blue line represents the execution times of BWA using the default allocation policy. Thus letting the operating system to be responsible for placing the data. It can be seen that BWA presents limited scalability beyond 32 cores. When executing an application on
a NUMA architecture, the most basic reasoning leads us to think that from 32 threads and higher, too many remote accesses are occurring and this counteracts the benefit of increasing parallelism. That is why it is even more surprising than when we forced the policy to allocate data so that it remains locally accessible to the main thread (red line), the performance is even worse. However, if we base our understanding on the NUMA performance aspects of this architecture, we know that it is very prone to have contention problems both in applications with sequential and random memory access patterns. Also, the fact of employing 32 threads-or higher- implies that remote access of the maximum distance cannot be avoided. However, as discussed in the Chapter[3] we can mitigate the effect if the data distribution is made using interleave policy, where access to all banks is reasonably balanced. When we change the policy to interleave (green line), we can observe how the scalability increases now beyond the 32 threads, as predicted.

The behavior of BWA is adjusted to what was expected, taking into account the NUMA performance aspects of the system. Thus, leading us to design an experiment to characterize and evaluate the behavior of a set of sequence aligners, which we will explain in the next section.
5.2 Allocation Strategies and Data Partitioning

The experiments carried out in this study are the result of a series of systematic tests designed to evaluate the behavior of aligners in the two NUMA architectures. The experimentation is divided into two main parts. In the first part, we tested several configurations of data allocation that enforced locality between threads and memory banks, alongside configurations where shared data structures are spread evenly on different memory banks. In the second part, experiments were based on the idea of data partitioning and replication: multiple independent instances of the same application were executed simultaneously, so the main shared data were replicated on each memory bank and input data split. A conceptual scheme of both experiments is shown in Figure 5.2. Details of these two schemes are presented below.

![Diagram of Allocation Strategies and Data Partitioning]

**Figure 5.2:** Experimentation approaches

5.2.1 Analysis of memory allocation

First, we analyzed the sensitivity of a particular aligner to different memory allocations. To achieve this, we carried out three experiments: the first one is a traditional scalability study in which aligners run with default system settings. We focused on 5 particular cases: using 8, 16, 32, 48 and 64 threads, because in the AMD system, each processor has 8 cores and 1 memory bank associated; so 8, 16, 32, 48 and 64 threads implies a minimum usage of 1, 2, 4, 6 and 8 NUMA nodes, respectively. To compare the output, we performed the same cases in the Intel system. For the other two cases, we used the Linux Tool `numactl` to set a memory policy allocation. With the parameter `–localalloc`, the data is allocated on the current node.
where threads are running the program. The idea behind this is to maximize local data affinity, keeping data onto the closest memory to the running processor. Finally, in the third case the  
-\textit{interleave} parameter is used so that memory is allocated in a round robin fashion between selected nodes. All the aligners that we used need two input data files: one that contains all the reads that need to be mapped and a second one that includes the reference genome index.

The goal of these experiments is, firstly, to gain insight into the level of scalability of the aligner. Additionally, re-running the aligner using different parameters of \textit{numactl} provides us with information about the behavior of the application and its data allocation sensitivity by using two extreme cases: when the locality and concurrency increase (\textit{localalloc}) and vice-versa (\textit{interleave}).

### 5.2.2 Data partitioning and replications strategies

With the second part of our experimentation, we aim to reduce to a minimum the contention produced when multiple threads access the index. To achieve this, we used data replication and data partitioning techniques. We ran simultaneous independent instances of a given aligner, each instance with a copy of the index. For example, if four concurrent independent instances are created, each one will process a 4th part of the original input data (see Figure 5.2b) and use an entire copy of the index. It is important to remark that creating instances increases the initial memory requirements, due to the fact now multiple copies of the index are required instead of just one. Ideally, each memory bank would hold a copy of the reference index, and the threads local to that bank would not need to access any data located remotely. For this case, we could think of each NUMA node as a symmetric multi-processor unit, capable of running an independent instance of an aligner. However, only BOWTIE2 and BWA generate an index small enough to fit on a memory bank of the systems we are using. The case of GEM and SNAP is different, where the index needs to be stored in more than one memory bank. To decide how many instances we would run, we took two critical factors into account:

1. the layout of the memory system.
2. the size of the index needed by the aligners.

Regarding the memory system, it is worth noting that AMD architecture presents more restricting features regarding the size of the memory banks than Intel because AMD memory banks are usually smaller. The AMD system also has a more complex layout because it
includes more NUMA nodes. Knowing this, we designed the experimentation having AMD’s constraints in mind, but considering that it would work on the Intel system too. The AMD system allowed us a maximum of 8 instances of 8 threads each, for an aligner with a small index. For these aligners, we also created other combinations with four and two instances of 16 and 32 threads, respectively; the goal here is to use always the maximum number of threads possible. In Table 5.1, we can see the sizes of the indexes used. For SNAP, it was not possible to create more instances than 2 due to the fact that the index does not fit on one memory bank of the AMD system and barely fits on two.

We introduced a novel hybrid execution technique combining the partitioning techniques with the memory allocation policies explained in Section 5.2.1: localalloc and interleave. Figure 5.3 shows a graphical representation of the three hybrid scenarios that we tested when partitioning techniques were combined with memory allocation policies. In this example, two instances (partitions) are created. Instance 1 is running on Socket 0 (Processor 0 and Processor 1) and Instance 2 on Socket 1 (processor 2 and processor 3). Each instance processes one half of the total input and uses a copy of the index. When combined with memory allocation policies, data is allocated in three different ways, which result in the three resulting scenarios:

- **Partitioning (Original)** Figure 5.3a Memory banks are reserved ahead of the execution using the command `membind`. NUMA nodes local to Processor 0 and Processor 1 are explicitly selected for Instance 1 and for Instance 2 the NUMA nodes local to Processor 2 and Processor 3. It does not necessarily mean that both memory banks would be used; i.e., they are allocated and will be used if needed.

- **Partitioning + Localalloc** Figure 5.3b The difference between this technique and “Partitioning” is that the reservation of NUMA nodes is performed implicitly, using the command `localalloc`.

- **Partitioning + Interleave** Figure 5.3c As in “Partitioning”, when an interleave policy is used, the NUMA nodes are explicitly reserved, but the allocation that takes place is done in a round-robin fashion, guaranteeing that both memories are being used and that the index is equally distributed.

### 5.3 Experimental Results

In this section, we show the main results obtained during the experimentation. For all the experiments, we used the reference human genome GRCh37 [63], maintained by The Genome
Reference Consortium, and two data sets were used as input data:

- **Synthetic benchmark [64]:**
  Single end, base length = 100, number of reads= 11M Size = 3.1GB

- **Segment extracted from NA12878 [65]:**
  Single end, base length = 100, number of reads= 22M Size = 5.4GB

The aligners were compiled using GCC 4.9.1 and we used the latest version available of each, as shown in the second column of Table 5.1. Results were obtained as an average of ten executions. Figures 5.4, 5.5, 5.6 and 5.7 of the following subsection show both average execution times and the corresponding standard deviation for each test. Detailed numerical values of mean execution times and corresponding relative errors are available in Tables A3, B5, C7 and D9 annexed at the end of this Chapter.
5.3.1 Analysis of memory allocation policies

In Figures 5.4, 5.5, 5.6 and 5.7, the execution times for all four aligners can be seen, each one using the datasets mentioned above: GCAT Synthetic Input and NA12878 Real Input. Execution times are also evaluated for both systems described in Chapter 2 (AMD-based cluster and Intel-based cluster). For each aligner, each figure shows how it scales when different memory allocation policies are used (namely, original, localalloc and interleave). This first set of experiments shows the behavior of the aligners under three scenarios. The first one (original) corresponds to the execution of a given aligner with its default parameters without any particular allocation policy or NUMA control and lets the operating system handle the allocation. On Linux systems, this will typically involve spreading the threads throughout the system and using the first-touch data allocation policy, which means that, when a program is started on a CPU, data requested by that program will be stored on a memory bank corresponding to its local CPU [37]. The allocation policy takes effect only when a process first requests a page. The second case (localalloc) corresponds to the scenario where the functions of the numactl utility are used to reduce remote access, restricting the allocation to specific nodes. The third case (interleave) evaluates the performance of the application when its memory pages are distributed in the nodes following a round-robin scheme. When aligners are executed with no explicit memory allocation policy (shown by the green line in Figures 5.4, 5.5, 5.6 and 5.7), scalability decreases significantly beyond 32 threads in all four aligners. When these aligners run on more than 32 cores, at least one NUMA node at a two-hops distance is used. Therefore, all the speedup gained due to multithreading is mitigated by the latency of remote accesses and traffic saturation of interconnection links. BOWTIE2 and BWA show a more regular and similar behavior. They reduce their execution time gradually to the point of using 32 cores. From there, their times are increased (slightly in the case of BWA and more significantly in the case of BOWTIE2). GEM and SNAP show a more unpredictable behavior since their execution times are not always reduced when the number of cores increases. It is worth mentioning that all aligners’ execution time when using the complete system is worse than using a smaller number of cores. From the two memory allocation policies, interleave is the one that performs best. For all the aligners, the interleaving policy reduces execution time as more processors are used (with the only exception being BOWTIE2 in the case of the real dataset). GEM and SNAP obtain their best execution times when interleave allocation is used. It is also worth noting that more stable and steady results are obtained with interleaving. The variability between executions is reduced drastically, mainly because interleave spreads data across the system, and the aligners
experience fair and balanced accesses.

From the two memory allocation policies, interleave is the one that performs best. For all the aligners, the interleaving policy reduces execution time as more processors are used (with the only exception being BOWTIE2 in the case of the real dataset). GEM and SNAP obtain their best execution times when interleave allocation is used. It is also worth noting that more stable and steady results are obtained with interleaving. The variability between executions is reduced drastically, mainly because interleave spreads data across the system, and the aligners experience fair and balanced accesses. As explained in Section 4.4, aligners share a common data structure -an index- among all threads. This structure is loaded in memory by the master thread (by default, Linux will place this data on its local memory bank).

Figure 5.4: Different memory allocation policies. The lower the better. Arch: AMD. Dataset: GCAT Synthetic Input
Figure 5.5: Different memory allocation policies. The lower the better. Arch: INTEL. Dataset: GCAT Synthetic Input

Therefore, as the number of threads increases, the memory bank that allocates the index becomes a bottleneck. Allocating data in an interleave way does not reduce remote accesses but guarantees a fair share of them between all memory banks and, therefore, prevents access contention, a phenomenon especially prone to happen in these architectures due to reduced memory bandwidth between NUMA nodes [17]. This reason explains why using an allocation policy that reinforces locality between processors and memory banks does not provide good results. Execution times are made worse by both increased latencies in memory accesses and increased congestion of banks containing the index. Execution times for a given aligner change depending on the system that is used. As mentioned in Chapter 2 systems are equivalent but not identical. It is also worth noting that the behavior of a given aligner will
change depending on the dataset. Aligners do not have a unique and uniform memory access pattern. Some queries are easier to map than others, and the capacity to map it among the regions of the human genome is not uniform \[66\]. Not all aligners process the queries in the same way, and the work done is irregular. The nature of the input data can significantly affect the behavior of an aligner. However, a memory placement strategy that distributes data evenly among all nodes seems to be the strategy that most consistently provides the best results.

Accessing remote data is not the only problem on NUMA architectures. The congestion is generated by multiple threads accessing a shared data structure increases. Executing the aligners with interleave policy does not reduce the number of remote accesses but diminishes the drawbacks of congestion in a sensible way. These outcome matches with the ones

\[\text{Figure 5.6: Different memory allocation policies. The lower the better. Arch: AMD. Dataset: NA12878 Real Input}\]
predicted by the *NUMA performance aspects* exposed in Chapter 3. When benchmarking both systems with STREAM, we observed and quantified how using an interleaving method to place the data improved the aggregated bandwidth of the entire system. However, after testing this theory, we validated that this also happens in a more complex application (than synthetic benchmark with sequential accesses).

### 5.3.2 Data partitioning and replication strategies

With the execution of multiple simultaneous instances, we aimed to attenuate the traffic of socket-interconnection buses and therefore reduce contention between threads. Latency will
EXPERIMENTATION

also improve because locality will increase and memory accesses will be available to local nodes. However, the way that instances are created is strongly conditioned by the system architecture that the aligner is running on. To generate the instances for each scenario, we have taken into account the aligners’ memory requirements (in particular, the size of the index generated by each aligner) and the amount of available space on the memory banks. Within the four chosen aligners for this study, there are two - BOWTIE2 and BWA - with indexes that are small enough to fit on one memory bank. For these aligners, we have been able to create as many instances as there where available NUMA domains. As the maximum number of threads is the same on both architectures, the same test can be executed on both systems. The maximum number of instances possible for AMD is eight instances of 8 threads. It constitutes the most desirable situation because all instances would have all needed data stored locally, and remote accesses to remote banks would be significantly reduced. Unfortunately, this is not the situation for GEM and SNAP. Their indexes require more than one bank. Therefore their instances are not entirely isolated and, although replication strategies are used, remote accesses and memory contention cannot be avoided altogether. For these two aligners, only two instances of 32 threads were possible. To complete the experimentation, we have designed tests with other possible combinations of instance x threads for BOWTIE2 and BWA: 4 instances of 16 threads and two instances of 32 threads. For all scenarios, the subset of NUMA nodes used was selected using the NUMA performance aspects. Once the number of instances is determined, a memory allocation policy is applied, thus generating the hybrid scenarios described in Section 5.2.2. In Table 5.2, we can see a list of all the hybrid scenarios that were tested in this part of the experimentation. The name of each case, as used in Figures 5.8, 5.9, 5.10 and 5.11, appears in the column Names.

Figures 5.8, 5.9, 5.10 and 5.11 show a complete speedup comparison of all strategies when the maximum number of cores are being used. It means that whole systems were used (with 64 threads and all memory banks). Speedup has been computed by using the execution time achieved by each aligner when it was executed on the whole system with its default setup (i.e., with no memory allocation policy and data partitioning strategy applied) as a baseline.

From these experiments, it is observed that all aligners benefit from execution through the creation of instances in all cases. However, some slight differences can be seen in the aligners’ behavior. On the one hand, aligners with small indexes (BOWTIE2 and BWA-MEM) improve their execution times in all scenarios in which instances are used, regardless of the memory allocation policy applied. The remarkable case is BOWTIE2, which presents speed-ups between 1.5x and 5x comparing to its original configuration. The best results are obtained
Table 5.2: Instances created for each aligner

<table>
<thead>
<tr>
<th>Aligner</th>
<th>#Inst x Threads</th>
<th>Policy</th>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>2x32t</td>
<td>Original</td>
<td>2Inst_Original</td>
</tr>
<tr>
<td></td>
<td>4x16t</td>
<td>Original</td>
<td>4Inst_Original</td>
</tr>
<tr>
<td></td>
<td>8x8t</td>
<td>Original</td>
<td>8Inst_Original</td>
</tr>
<tr>
<td>BOWTIE2</td>
<td>2x32t</td>
<td>Localalloc</td>
<td>2Inst_Localalloc</td>
</tr>
<tr>
<td>BWA-MEM</td>
<td>4x16t</td>
<td></td>
<td>4Inst_Localalloc</td>
</tr>
<tr>
<td></td>
<td>2x32t</td>
<td>Interleave</td>
<td>2Inst_Interleave</td>
</tr>
<tr>
<td></td>
<td>4x16t</td>
<td></td>
<td>4Inst_Interleave</td>
</tr>
<tr>
<td>SNAP</td>
<td>2x32t</td>
<td>Original</td>
<td>2Inst_Original</td>
</tr>
<tr>
<td>GEM</td>
<td>2x32t</td>
<td>Localalloc</td>
<td>2Inst_Localalloc</td>
</tr>
<tr>
<td></td>
<td>2x32t</td>
<td>Interleave</td>
<td>2Inst_Interleave</td>
</tr>
</tbody>
</table>
Figure 5.8: Different memory allocation policies. The lower the better. Arch: AMD. Dataset: GCAT Synthetic Input

when using a higher number of instances (8Inst Original in all Figures), which corresponds to the case that maximum locality is achieved, also reducing memory controller congestion. When using a smaller number of instances aligners also benefit from the combination with the memory allocation policies. The interleaving allocation provides slightly better results compared to the other two cases. That means that if BWA-MEM or BOWTIE2 are executed with four instances of 16 processors each, allocating memory in a round-robin fashion provides better results than allocating memory in any other way. This hybrid schema offers the best trade-off between locality increase and contention avoidance.

On the other hand, aligners with larger indexes (GEM and SNAP) always benefit from the creation of instances combined with interleaving allocation. Execution times did not remain
5.3.3 Summary results

Figure 5.12 summarizes the main results achieved by each aligner when they were executed with the maximum number of resources (i.e., using 64 threads). In each Figure, we can see four sets of tests: one for each input and one for each architecture, four combinations better than the default ones when the instances were combined using other allocation schemes. The strategy that combines multiple instances and memory interleaving improves execution time up to 5x in the case of GEM and 2.8X in the case of SNAP. However, it is worth noting that none of these results are better than using an interleave allocation policy alone.
of experiments in total. For each set, the first column represents the execution time of the aligner without any added memory policy or data partition. The second column is the best time achieved when the aligner was executed with a memory policy, and the last column is the best resulting execution time with data partitioning and independent instances. As said before, aligners exhibit random access patterns to memory and are also sensitive to the input data. However, as Figure 5.12 shows, significant improvements in execution times were achieved by all four aligners when memory interleaving or multiple instances were used. According to these results, we can deduce a rule of thumb that has shown to be valid for this set of representative aligners. For BOWTIE2 and BWA-MEM, where the size of the index is much smaller than the capacity of a memory bank, data partitioning arises as the

Figure 5.10: Different memory allocation policies. The lower the better. Arch: AMD. Dataset: NA12878 Real Input
best solution because it allows us to minimize the usage of socket interconnection links (QPI and HyperTransport). For aligners with larger indexes, such as SNAP and GEM, even when data partitioning is employed, more than one memory bank is required to store the index, and accesses through interconnection links cannot be avoided. For these aligners, execution time is mostly reduced when a pure interleave policy is employed, ensuring that accesses are equally distributed among all memory banks and, therefore, the contention is minimized.

Observed improvements in execution times were higher when using real input data. Significant improvements were also obtained on the AMD system. The NUMA architecture of this system has longer distances between processors and memory banks and shows latencies greater than those of the Intel system. Therefore, aligners suffer numerous penalties in AMD
Figure 5.12: Execution times: Summary results for all aligners. The lower the better

architectures regarding memory accesses in general; but, by applying NUMA-aware strategies, aligners also show more substantial improvements in such systems.
5.4 Annexed results

Table A3: Execution times for BOWTIE2 using different memory allocation policies

<table>
<thead>
<tr>
<th></th>
<th>BOWTIE2</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>AMD</td>
<td>GCAT</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
<td>Localalloc</td>
</tr>
<tr>
<td>Threads</td>
<td></td>
<td>Mean [s] Error [%]</td>
<td>Mean [s] Error [%]</td>
<td>Mean [s] Error [%]</td>
<td>Mean [s] Error [%]</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>620.37</td>
<td>0.80</td>
<td>752.28</td>
<td>0.57</td>
<td>749.88</td>
</tr>
<tr>
<td>16</td>
<td>339.21</td>
<td>2.77</td>
<td>480.94</td>
<td>6.47</td>
<td>455.81</td>
</tr>
<tr>
<td>32</td>
<td>260.75</td>
<td>15.73</td>
<td>573.00</td>
<td>6.91</td>
<td>326.25</td>
</tr>
<tr>
<td>48</td>
<td>385.47</td>
<td>13.52</td>
<td>369.64</td>
<td>14.60</td>
<td>383.85</td>
</tr>
<tr>
<td>64</td>
<td>529.92</td>
<td>10.51</td>
<td>520.64</td>
<td>11.82</td>
<td>480.47</td>
</tr>
</tbody>
</table>

|                | INTEL   | GCAT          |               |               |               |
|                | NA12878 | Original      | Localalloc    | Interleave    | Original      | Localalloc    | Interleave    |
| Threads        |         | Mean [s] Error [%] | Mean [s] Error [%] | Mean [s] Error [%] | Mean [s] Error [%] | Mean [s] Error [%] | Mean [s] Error [%] |
|                |         |               |               |               |               |               |               |
| 8              | 721.51  | 0.41          | 722.62        | 0.27          | 666.22        | 0.59          | 915.07        | 0.10          | 1174.50       | 0.51          | 1169.95       | 0.26          |
| 16             | 403.03  | 1.00          | 403.99        | 3.43          | 373.43        | 1.86          | 495.40        | 0.62          | 563.76        | 1.72          | 569.07        | 0.33          |
| 32             | 350.98  | 1.15          | 347.59        | 2.18          | 252.68        | 5.08          | 303.09        | 1.69          | 501.55        | 1.36          | 530.68        | 0.39          |
| 48             | 293.05  | 1.92          | 279.86        | 1.58          | 275.15        | 2.71          | 296.58        | 1.14          | 344.19        | 1.26          | 368.96        | 0.75          |
| 64             | 400.77  | 1.60          | 402.51        | 1.75          | 417.47        | 1.77          | 292.12        | 0.86          | 282.03        | 1.10          | 292.53        | 0.54          |
**Table A4:** Execution times for BOWTIE2 using data partitioning

<table>
<thead>
<tr>
<th>Name</th>
<th>Threads</th>
<th>AMD NA12878</th>
<th>AMD GCAT</th>
<th>INTEL NA12878</th>
<th>INTEL GCAT</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Mean [s]</td>
<td>Error [%]</td>
<td>Mean [s]</td>
<td>Error [%]</td>
<td>Mean [s]</td>
</tr>
<tr>
<td>2Inst Original</td>
<td>32</td>
<td>284.78</td>
<td>201.50</td>
<td>272.50</td>
<td>4.25</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>115.03</td>
<td>188.86</td>
<td>214.53</td>
<td>6.16</td>
</tr>
<tr>
<td>8Inst Original</td>
<td>8</td>
<td>104.52</td>
<td>175.36</td>
<td>191.58</td>
<td>2.42</td>
</tr>
<tr>
<td>2Inst Localalloc</td>
<td>32</td>
<td>398.46</td>
<td>202.33</td>
<td>278.52</td>
<td>2.77</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>127.08</td>
<td>175.09</td>
<td>211.12</td>
<td>2.31</td>
</tr>
<tr>
<td>2Inst Interleave</td>
<td>32</td>
<td>160.59</td>
<td>197.56</td>
<td>293.21</td>
<td>1.45</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>118.75</td>
<td>180.48</td>
<td>206.66</td>
<td>2.07</td>
</tr>
</tbody>
</table>
## Experimentation

Table B5: Execution times for BWA-MEM using different memory allocation policies

### BWA-MEM

#### AMD

<table>
<thead>
<tr>
<th>Threads</th>
<th>Original Mean [s]</th>
<th>Error [%]</th>
<th>Localalloc Mean [s]</th>
<th>Error [%]</th>
<th>Interleave Mean [s]</th>
<th>Error [%]</th>
<th>Original Mean [s]</th>
<th>Error [%]</th>
<th>Localalloc Mean [s]</th>
<th>Error [%]</th>
<th>Interleave Mean [s]</th>
<th>Error [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>621.40</td>
<td>0.85</td>
<td>618.78</td>
<td>0.41</td>
<td>620.07</td>
<td>0.33</td>
<td>1223.40</td>
<td>9.22</td>
<td>1224.56</td>
<td>0.40</td>
<td>1225.20</td>
<td>0.23</td>
</tr>
<tr>
<td>16</td>
<td>333.76</td>
<td>0.33</td>
<td>348.35</td>
<td>0.45</td>
<td>346.37</td>
<td>0.71</td>
<td>619.03</td>
<td>0.40</td>
<td>658.15</td>
<td>0.60</td>
<td>655.84</td>
<td>0.90</td>
</tr>
<tr>
<td>32</td>
<td>234.89</td>
<td>0.71</td>
<td>269.83</td>
<td>15.78</td>
<td>214.69</td>
<td>0.26</td>
<td>381.98</td>
<td>1.44</td>
<td>479.43</td>
<td>14.90</td>
<td>377.16</td>
<td>0.42</td>
</tr>
<tr>
<td>48</td>
<td>254.82</td>
<td>25.22</td>
<td>264.07</td>
<td>15.57</td>
<td>164.24</td>
<td>1.68</td>
<td>374.68</td>
<td>9.25</td>
<td>480.27</td>
<td>11.91</td>
<td>276.52</td>
<td>0.40</td>
</tr>
<tr>
<td>64</td>
<td>244.95</td>
<td>6.02</td>
<td>253.13</td>
<td>2.45</td>
<td>145.10</td>
<td>1.49</td>
<td>376.32</td>
<td>2.20</td>
<td>367.73</td>
<td>0.67</td>
<td>234.23</td>
<td>0.40</td>
</tr>
</tbody>
</table>

#### INTEL

<table>
<thead>
<tr>
<th>Threads</th>
<th>Original Mean [s]</th>
<th>Error [%]</th>
<th>Localalloc Mean [s]</th>
<th>Error [%]</th>
<th>Interleave Mean [s]</th>
<th>Error [%]</th>
<th>Original Mean [s]</th>
<th>Error [%]</th>
<th>Localalloc Mean [s]</th>
<th>Error [%]</th>
<th>Interleave Mean [s]</th>
<th>Error [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>671.26</td>
<td>0.25</td>
<td>598.68</td>
<td>0.39</td>
<td>591.44</td>
<td>1.21</td>
<td>1242.25</td>
<td>8.23</td>
<td>1181.80</td>
<td>0.32</td>
<td>1175.52</td>
<td>0.48</td>
</tr>
<tr>
<td>16</td>
<td>368.48</td>
<td>1.10</td>
<td>391.26</td>
<td>3.24</td>
<td>385.35</td>
<td>0.65</td>
<td>651.29</td>
<td>0.48</td>
<td>703.42</td>
<td>2.71</td>
<td>691.03</td>
<td>1.23</td>
</tr>
<tr>
<td>32</td>
<td>234.07</td>
<td>1.37</td>
<td>309.14</td>
<td>2.70</td>
<td>293.92</td>
<td>0.67</td>
<td>393.66</td>
<td>0.33</td>
<td>550.59</td>
<td>2.84</td>
<td>521.19</td>
<td>0.35</td>
</tr>
<tr>
<td>48</td>
<td>239.97</td>
<td>25.11</td>
<td>265.37</td>
<td>4.11</td>
<td>234.67</td>
<td>0.76</td>
<td>350.85</td>
<td>3.75</td>
<td>417.55</td>
<td>3.18</td>
<td>380.09</td>
<td>0.27</td>
</tr>
<tr>
<td>64</td>
<td>260.20</td>
<td>10.04</td>
<td>263.35</td>
<td>10.85</td>
<td>207.75</td>
<td>1.24</td>
<td>387.17</td>
<td>1.60</td>
<td>384.51</td>
<td>2.25</td>
<td>368.91</td>
<td>1.02</td>
</tr>
</tbody>
</table>

Table B6: Execution times for BWA-MEM using data partitioning

### BWA-MEM

#### AMD

<table>
<thead>
<tr>
<th>Name</th>
<th>Threads</th>
<th>NA12878 Original Mean [s]</th>
<th>Error [%]</th>
<th>GCAT Mean [s]</th>
<th>Error [%]</th>
<th>NA12878 Localalloc Mean [s]</th>
<th>Error [%]</th>
<th>GCAT Mean [s]</th>
<th>Error [%]</th>
<th>NA12878 Interleave Mean [s]</th>
<th>Error [%]</th>
<th>GCAT Mean [s]</th>
<th>Error [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>2Inst_Ori</td>
<td>32</td>
<td>315.47</td>
<td>17.32</td>
<td>254.19</td>
<td>10.03</td>
<td>186.52</td>
<td>3.61</td>
<td>285.88</td>
<td>2.50</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4Inst_Ori</td>
<td>16</td>
<td>120.62</td>
<td>26.46</td>
<td>187.19</td>
<td>11.30</td>
<td>128.07</td>
<td>4.82</td>
<td>273.23</td>
<td>3.74</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8Inst_Ori</td>
<td>8</td>
<td>95.18</td>
<td>14.60</td>
<td>174.18</td>
<td>8.13</td>
<td>113.95</td>
<td>18.42</td>
<td>240.59</td>
<td>3.04</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4Inst_Loc</td>
<td>16</td>
<td>143.99</td>
<td>9.20</td>
<td>204.88</td>
<td>10.59</td>
<td>144.75</td>
<td>6.67</td>
<td>224.44</td>
<td>9.41</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2Inst_Int</td>
<td>32</td>
<td>123.69</td>
<td>0.92</td>
<td>208.05</td>
<td>4.07</td>
<td>194.23</td>
<td>14.67</td>
<td>283.68</td>
<td>0.67</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4Inst_Int</td>
<td>16</td>
<td>122.814</td>
<td>9.51</td>
<td>173.85</td>
<td>0.11</td>
<td>131.57</td>
<td>8.24</td>
<td>196.14</td>
<td>1.14</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Table C7: Execution times for GEM using different memory allocation policies

<table>
<thead>
<tr>
<th>Threads</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
</tr>
<tr>
<td>16</td>
</tr>
<tr>
<td>32</td>
</tr>
<tr>
<td>48</td>
</tr>
<tr>
<td>64</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>GEM</th>
<th>AMD</th>
<th>GCAT</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>359.59</td>
<td>9.24</td>
<td>675.36</td>
<td>4.06</td>
<td>522.62</td>
<td>6.06</td>
<td>156.29</td>
<td>5.40</td>
<td>217.24</td>
<td>5.71</td>
</tr>
<tr>
<td>16</td>
<td>562.60</td>
<td>14.49</td>
<td>894.56</td>
<td>13.89</td>
<td>700.63</td>
<td>7.67</td>
<td>121.81</td>
<td>13.01</td>
<td>223.38</td>
<td>14.22</td>
</tr>
<tr>
<td>32</td>
<td>712.83</td>
<td>4.44</td>
<td>691.45</td>
<td>14.40</td>
<td>329.49</td>
<td>2.90</td>
<td>175.94</td>
<td>14.47</td>
<td>232.01</td>
<td>12.09</td>
</tr>
<tr>
<td>48</td>
<td>488.35</td>
<td>15.41</td>
<td>756.27</td>
<td>12.09</td>
<td>127.51</td>
<td>4.14</td>
<td>115.13</td>
<td>8.22</td>
<td>87.39</td>
<td>13.86</td>
</tr>
<tr>
<td>64</td>
<td>616.17</td>
<td>15.99</td>
<td>805.56</td>
<td>12.82</td>
<td>127.24</td>
<td>2.76</td>
<td>186.47</td>
<td>10.18</td>
<td>162.08</td>
<td>10.21</td>
</tr>
</tbody>
</table>

Table C8: Execution times for GEM using data partitioning

<table>
<thead>
<tr>
<th>GEM</th>
<th>AMD</th>
<th>GCAT</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
<th>MEAN [s]</th>
<th>ERROR [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Name</td>
<td>Threads</td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td></td>
</tr>
<tr>
<td>2Inst, Original</td>
<td>32</td>
<td>183.21</td>
<td>7.10</td>
<td>197.09</td>
<td>23.35</td>
<td>68.99</td>
<td>12.04</td>
<td></td>
</tr>
<tr>
<td>2Inst, Localalloc</td>
<td>32</td>
<td>457.95</td>
<td>6.57</td>
<td>277.87</td>
<td>7.92</td>
<td>362.08</td>
<td>22.92</td>
<td></td>
</tr>
<tr>
<td>2Inst, Interleave</td>
<td>32</td>
<td>90.68</td>
<td>14.34</td>
<td>36.39</td>
<td>10.91</td>
<td>104.25</td>
<td>9.63</td>
<td>38.77</td>
</tr>
</tbody>
</table>
Table D9: Execution times for SNAP using different memory allocation policies

<table>
<thead>
<tr>
<th></th>
<th>SNAP AMD</th>
<th></th>
<th>SNAP INTEL</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
</tr>
<tr>
<td></td>
<td>Mean [s]</td>
<td>Error [%]</td>
<td>Mean [s]</td>
<td>Error [%]</td>
</tr>
<tr>
<td>Threads</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>266.12</td>
<td>10.19</td>
<td>359.73</td>
<td>5.32</td>
</tr>
<tr>
<td>16</td>
<td>179.59</td>
<td>13.67</td>
<td>297.83</td>
<td>6.04</td>
</tr>
<tr>
<td>32</td>
<td>196.58</td>
<td>9.30</td>
<td>316.51</td>
<td>10.42</td>
</tr>
<tr>
<td>64</td>
<td>310.72</td>
<td>5.84</td>
<td>300.50</td>
<td>8.82</td>
</tr>
</tbody>
</table>

Table D10: Execution times for SNAP using data partitioning

<table>
<thead>
<tr>
<th></th>
<th>SNAPSHOT AMD</th>
<th></th>
<th>SNAPSHOT GCAT</th>
<th></th>
<th>SNAPSHOT INTEL</th>
<th></th>
<th>SNAPSHOT GCAT</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
<td>Localalloc</td>
<td>Interleave</td>
<td>Original</td>
</tr>
<tr>
<td></td>
<td>Mean [s]</td>
<td>Error [%]</td>
<td>Mean [s]</td>
<td>Error [%]</td>
<td>Mean [s]</td>
<td>Error [%]</td>
<td>Mean [s]</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Threads</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>238.91</td>
<td>10.97</td>
<td>303.45</td>
<td>6.86</td>
<td>451.87</td>
<td>7.10</td>
<td>456.73</td>
</tr>
<tr>
<td>16</td>
<td>226.77</td>
<td>10.96</td>
<td>287.53</td>
<td>10.75</td>
<td>255.52</td>
<td>8.15</td>
<td>355.90</td>
</tr>
<tr>
<td>32</td>
<td>172.94</td>
<td>15.69</td>
<td>299.46</td>
<td>11.00</td>
<td>199.86</td>
<td>15.00</td>
<td>301.85</td>
</tr>
<tr>
<td>48</td>
<td>266.62</td>
<td>5.76</td>
<td>339.21</td>
<td>16.89</td>
<td>102.13</td>
<td>6.21</td>
<td>168.38</td>
</tr>
<tr>
<td>64</td>
<td>209.95</td>
<td>8.57</td>
<td>284.43</td>
<td>2.53</td>
<td>77.62</td>
<td>0.72</td>
<td>344.04</td>
</tr>
</tbody>
</table>
Conclusions and Future lines

"Anyone who keeps the ability to see beauty never grows old."

- Franz Kafka
6.1 Conclusions

Knowing the underlying architecture where applications are running is a key aspect of achieving their optimal performance. If an application is memory-bound, it might suffer performance issues when executed in NUMA systems. We proposed a series of systematic steps to evaluate and characterize NUMA systems. These steps were carried out employing well-known memory benchmarks like STREAM and lmbench. We quantify the effects of poor locality, the sub-optimal combination of a subset of NUMA nodes and race condition of parallel applications through four dimensions called NUMA performance aspects. With this, we gained an understanding of the architecture employed when we design experimentation to validate if the behavior detected with the synthetic benchmarks could be extrapolated to real and complex applications.

We evaluated several genomic aligners, and we have seen that they exhibit poor scalability in modern NUMA systems because they are penalized by contention and/or remote memory bank accesses. Our experiments have shown that increasing data locality may not always produce the expected outcome. As the number of threads increases, all aligners show poor scalability. In our study, we have shown that this phenomenon is not only related to remote memory accesses taking place but also due to the memory contention generated by the race of multiple threads trying to access a single shared data structure. Minimizing memory contention is a key aspect in increasing the performance of aligners.

Our experiments have found that congestion causes the most severe NUMA problems for a representative set of genomic aligners. Congestion happens when the rate of requests to memory banks or the rate of traffic over interconnects is too high. As a consequence, memory accesses are delayed and execution time increases. We evaluated several strategies that can be applied to alleviate this problem so that the application can take advantage of all the available processors in existing NUMA systems. These strategies do not require changes to the original application code, and they don’t require either kernel modification or privilege permissions. We explored several solutions based on combining two concepts: congestion avoidance and increased locality. Congestion avoidance looks to balance the traffic among multiple memory banks. Genomic aligners with large reference indexes (GEM and SNAP) especially benefit from this strategy. Running aligners reinforced the increased locality in multiple instances. Aligners with small indexes (BWA-MEM and BOWTIE2) show significant improvements in execution time thanks to this strategy, which could also be combined with memory interleaving if the size of the memory banks is not large enough to hold genome indexes.
CONCLUSIONS AND FUTURE LINES

Improvements in execution times of 5x and 2.5x were obtained for BOWTIE2 and BWA-MEM, respectively, when the aligners were executed with the maximum number of threads (64). For other aligners with larger indexes (i.e., SNAP and GEM), the *interleave* technique proved to be a better choice because the index is distributed across the system memory banks and mitigates the contention produced when all threads try to access the same data structure. Improvements of up to 5x and 2.8x were obtained for GEM and SNAP, respectively.

Although no single strategy emerges as the best for all scenarios, the proposed approach of this study improved the performance of all the aligners. It is not a minor achievement taking into account that the behavior of the aligners is quite susceptible to variation depending on the nature of the input data and the system architecture they run on.

It is reasonable to assume that NUMA systems in the future will have more NUMA nodes and more complicated interconnection topologies. Thus implying that NUMA effects will continue to be a concern. Therefore, it will be necessary to apply techniques such as those presented in this work so that parallel applications can be optimized efficiently to take advantage of all the resources that will be available in those systems. We have evaluated several strategies that don’t require changes to the applications. However, we expect that higher improvements could be achieved if NUMA-awareness is integrated into the design of new aligners.

6.2 Future Lines

As future lines we see several possible open paths:

- **Implement a unified tool:** As a first step we want to implement the methodology as a tool. We would like to codify a program that groups all the benchmarks used in this research, allowing the user to characterize the architecture automatically. Currently, it is implemented through scripts, but we believe it would be a better fit to do it as an autonomous tool. Ideally, this tool would allow an inexperienced user to understand the limitations of the system better, and obtain hints on how to execute parallel applications, for example, if it is convenient to run her application using the interleave policy, or maybe instances replication, or nothing at all. Also, it would allow to interpret and quantify the *NUMA performance aspects* graphically. This information could be beneficial to parallel programmers too. They could design better allocation strategies, through the usage of the numalib library, identify the
stages of their code and select an allocation policy accordingly.

- **Create NUMA pragmas:** As a long-term goal, we would like to focus our guideline not only in characterizing the architecture but also the parallel applications. We want to create *NUMA pragmas* to instrument code to detect applications’ contention problems in NUMA systems. That could be reported by existing performance analysis tools that use our tool as an add-on. In a second step, specific pragmas could be interpreted by compiler an apply creation and execution of instances and the data replication directly.

- **Extend our model to all NGS workflow:** Some of the execution strategies introduced in this research can be extended to the following stages of the NGS workflow. For the variant calling stage, where the parallelization of the tools is limited, we have some preliminary results that show scalability problems on NUMA systems. It would be interesting to test our proposed guideline with several known applications of this stage, specially GATK [67]. When working with a set of stages sharing intermediate data, like genomic workflows, the spectrum of possible strategies for improving the usage of memory increases. One promising idea is to apply instance and data replication to the alignment and variant calling stages but passing the data between them using a NUMA-aware RAM drive instead of writing files on disk.
Bibliography


