notes_cm_td_m2_sesi/PROGPAR/cours-tp 1/template.tex

\input{packages.tex}
\newcounter{mypara}
\setcounter{mypara}{0}

\setcounter{secnumdepth}{4}

%Bibliography
\addbibresource{bibliographie.bib}

% Document
\begin{document}{

\sloppy
\input{titlepage.tex}
\setcounter{tocdepth}{4}
\setcounter{secnumdepth}{4}

% Table of contents
\tableofcontents

\newpage

% First chapter
\chapter{Introduction}

This class focuses on NVIDIA Jetson TX2, an embedded architecture from the same family than the Nitendo Switch. It is used in cars, satellites, etc.

This isn't a toy but a devkit.

\chapter{Programmable architectures}

\section{Simplified CPU architecture}

What we do in CS is using instructions to manipulate memory. More precisely, there are :

\begin{itemize}
\item control instructions
\item arithmetic instructions
\item memory instructions
\end{itemize}

\section{CPU Architecture}

On modern CPU, CPU/RAM freq is between 1GHz and 4GHz

Memory throughput is ~50 GB/s (DDR5)

But CPU is much faster (×100), which is solved by caching.

\subsection{Memory hierarchy}

Most applications often reuse the same data.

We have :
\begin{itemize}
\item L1: ~2 cycles
\item L2: ~10 cycles
\item L3: ~30 cycles
\end{itemize}

Storing implies two methods :
\begin{itemize}
\item writethrough : write to both RAM and cache, a choice made in some architectures, leading to complex design in CPU arch.
\item writeback
\end{itemize}

\subsection{spatial locality}

While one cache line is a certain amount of data and the smallest packet between RAM and CPU is a cache line, it may be interesting to access data contiguously

\subsection{SIMD}

With scalar instructions, in 1 cycle, with two operands you create the content of a register.

With SIMD, in 1 cycle you do that with 4 operations at the same time. The registers are larger than regular ones, and called differently : vector registers.

\section{Co-processors}

Hardware separated from CPU, connected often via PCIe, with its own RAM.

\section{GPU Architecture}

Generally much more parallel than CPUs, fewer control parts than CPUs but much more compute units.

Memory is faster : 500 GB/s

It is often suitable for scientific computing but not always !

Hardware acceleration for AI (tensor cores) are often present. Not studied in this course.

\subsection{Nvidia Ampere}

We can see there are a lot of cores. Much more than in a CPU.

\section{Supercomputer architecture}

Here we design that as 8 computers connected via a switch. Theorical max performance is 8 times that of a node.

\chapter{Single-core CPU architecture}

\section{Pipeless processors}

[Remembering about pipeline]

\section{Pipelined processors}

[Remembering about pipeline]

Note : pipeline makes latency higher

Modern CPUs are 10 to 20 stages pipelined.

Conditional branchment implies that it's difficult to guess which instruction is the next, which is bad for pipeline efficiency (it created bubbles) so we have to predict... with branch prediction, to limit the effect.

\section{Superscalar processors}

[Remembering about this]

ILP = superscalar CPU fetcher/decoder

\section{Out of order execution (OoO)}

The CPU can reorder instructions basing on dependencies of instructions.

\section{Example: Apple Silicon M1}

This is an OoO CPU, since there is a dispatcher that has a reorder buffer.

There is a big part dedicated to SIMD.

\section{Example: Intel Alder Lake}

Much larger decoding part (yeah, CISC).

\section{So... our new toy ?}

There are :
\begin{itemize}
\item 2 × powerful Nvidia Denver 2 cores @ 2.04 GHz
\item 4 × powerful ARM Cortex A57 (Big) cores @ 2.0GHz
\item Weird heterogeneous design
\item No energy efficient cores
\end{itemize}

\subsection{Cortex A57 (2015)}

This is a 18-stage pipelined CPU, with 3-way decoder and it's 8-wide super-scalar. This is a OoO CPU.

\subsection{Nvidia Denver 2 (2016)}

Not much info about it...

We have :
\begin{itemize}
\item 15-stage pipeline
\item 2-way decoder
\item 7-wider superscalar
\item in-order execution (ARM code translated to something by a hardware translator, so it can be considered OoO since that translator reorders !)
\end{itemize}

Side note : the project was to decode ARM and x86 assembly on the same CPU, but Intel didn't gave the license

\section{Nvidia Jetson TX2 topology}

[See slides]

\chapter{Single-core CPU optimizations}

It's a good idea to rely on compilers for the sake of readability.

Compilers provide optimizations with '-O' options.

There is '-march=native' that is sometimes the only way to allow code vectorization.

\section{Latency and throughput}

We have :
\begin{itemize}
\item add: 4 cycles, 0.5 CPI
\item sub: idem
\item mul: idem
\item div: 14 cycles, 4 CPI
\end{itemize}

For the 3 first, we can have 2 inst/cycle !

\subsection{Example: division}

Here we see than we should avoid divisions (and use preprocessor to rewrite that as something else).

\section{Special functions}

On modern CPUs we have some common mathematical functions. These are often expensive, except for rsqrt (used for 3D/2D calculation)

\section{Function calls}

Sometimes it's better to avoid calls, especially depending on where the function call is located, for example with the stencil algorithm.

Inlining is a good idea.

\section{Loop unrolling}

It reduces the time spent in loop control, and especially the risk of branch prediction error.

In fact, it also increases optimization opportunities while exposing more parallelism for ILP, masks instruction latency.

To force the compiler to unroll code with '#pragma unroll 2'

The only problem with unrolling is that code is much bigger. So, only unroll when it makes sense !

\section{Unroll and jam}

It consists in unrolling and changing the order of instructions. It is especially useful for breaking data dependencies.

However it requires more registers.

\section{Variables rotation}

Here the compiler is not doing a good job unless you use the 'restrict' keyword.

\subsection{Reduction}

The compiler is able to do it if you help it

\section{Loop fusion}

This improves data reuse, taking advantage of temporal locality.

\section{Loop splitting}

Useful when there are more variables in a loop than registers available

\section{About conditional branching}

[...]

\section{Memory accesses}

To limite memory bandwidth overload, you can use pre-fetching : it consists in pre-loading data that are predicted to be useful later.

In all case, consider memory access pattern and take advantage of cache !

\newpage

% Bibliography
\nocite{*}
\addcontentsline{toc}{chapter}{Bibliographie}
\printbibliography

\end{document}