In the last post I’ve explained how to install nVidia Toolkit 10.1 on Ubuntu 18.04 LTS.
Details can be found on the following pages:
https://www.josip-pojatina.com/en/how-to-install-cuda-toolkit-on-ubuntu-18-04-lts/
or
https://www.performatune.com/en/how-to-install-cuda-toolkit-on-ubuntu-18-04-lts/
In this article I’ll explain the most important card details you need to know.
Prerequisite for this article (besides nVidia drivers and Toolkit 10.1 Toolkit installed) is to have all samples compiled.
All you need to do is to cd into the Samples directory and execute the following:
user@hostname:~/Downloads/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro M2200"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 4044 MBytes (4240179200 bytes)
( 8) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores
GPU Max Clock rate: 1036 MHz (1.04 GHz)
Memory Clock rate: 2754 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
Here are the list with short explanation of the most important specs needed for CUDA tuning:
1.
Detected 1 CUDA Capable device(s)
In this case I have only one CUDA device. In reallity you might have 2, 3 or more on one machine.
2.
Device 0: “Quadro M2200”
This is a my nVidia device name. If I need to search for some information, I need to know full card name.
To get the full card name, you can execute the following:
user@hostname:~>lspci | grep -e VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GM206GLM [Quadro M2200 Mobile] (rev a1)
It’s also good to know installed driver version of your card.
You can find that by executing the following:
jp@performatune.com:~>nvidia-smi
Mon Mar 25 17:28:46 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro M2200 Off | 00000000:01:00.0 On | N/A |
| N/A 44C P0 N/A / N/A | 1631MiB / 4043MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1833 G /usr/libexec/Xorg 89MiB |
| 0 2022 G /usr/bin/gnome-shell 44MiB |
| 0 2397 G /usr/libexec/Xorg 695MiB |
| 0 2522 G /usr/bin/gnome-shell 252MiB |
| 0 2754 G ...are/jetbrains-toolbox/jetbrains-toolbox 401MiB |
| 0 2762 G cairo-dock 9MiB |
| 0 3712 G ...-token=5C40E868867C4C7F98ABDA339E4D154C 129MiB |
+-----------------------------------------------------------------------------+
#or you can get less details by executing:
user@hostname:/usr/local/cuda-10.0/samples>cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.56 Fri Mar 15 12:59:26 CDT 2019
3.
CUDA Driver Version / Runtime Version 10.1 / 10.1
This is important to know in case you want to install GPU related packages (like a packages from the nVidia RAPIDS projects) that are not available in the Anaconda default repo.
4.
CUDA Capability Major/Minor version number: 5.2
With this information you can open the following link:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
and search for “Table 14 Technical Specifications per Compute Capability”.
In my case I can find that for my graphic card which has CUDA Capability 5.2, maximum number of resident blocks per multiprocessor is 32.
This is the first limitation when tuning CUDA parameters you need to be aware.
5.
Total amount of global memory: 4044 MBytes (4240179200 bytes)
This information will play a role to properly size the maximum size of arrays or matrix that you want to load into your Graphic card.
You need to combine a memory size information with array data type (byte/int/float) and type precision (e.g. int8 or float64) to calculate your graphic card memory limits, thus avoiding OOM (Out of memory) errors in your application.
6.
( 8) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores
There are two important data you can observe.
My graphic card has 8 Multiprocessors, and each multiprocessor has 128 CUDA cores.
In total if I multiply 8×128 = 1024, which is a total number of CUDA cores available.
Each streaming multiprocessor contains:
- memory registers
- memory caches
- thread scheduler
- CUDA cores
7.
GPU Max Clock rate: 1036 MHz (1.04 GHz)
This is similar to CPU. Faster clock rate means faster GPU processing.
8.
Memory Clock rate: 2754 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Memory parameters are important to speed up data transfer from the main RAM memory to GPU device, and back from the GPU to CPU.
As I’ll explain in one of the future articles, data transfer can take 50% of the total elapsed time for GPU processing.
9.
Maximum number of threads per multiprocessor: 2048
This is related to point 6 and imposes another limitation when tuning CUDA parameters (point 4 is the first limit).
In this case I have 8 multiprocessors and each multiprocessors can have up to 2048 threads.
10.
Maximum number of threads per block: 1024
This is the third limit when tuning the CUDA cores.
11.
Warp size: 32
Group of 32 threads in this case is called warp and represent the smallest unit that can be scheduled.
It means that the size of a thread block is always multiple of 32 threads.
To summarize, we have 3 main limits:
- from point 4 I can have at max 32 blocks per multiprocessor
- from point 6 I have 8 multiprocessors and 1024 CUDA cores in total (128 per multiprocessor)
- from point 9 I can have at max 2048 threads per multiprocessor
To get the optimal performance:
- try to use all symmetric multiprocessing units
- try to use all cores in symmetric multiprocessing units busy
- optimize shared memory and registers
- optimize device memory
- optimize device memory access
Threads are grouped into blocks which are grouped into a grid.
Grid can execute only one kernel function (code that will be executed on GPU) and have the same dimension.
This is a very brief explanation of CUDA architecture, but to explain how it works you better purchase one of the books that are available.
The following example might help to describe what you need to know about CUDA as a developer.
Let’s assume I have 1 million integers in array.
When you call a kernel function, you only need to find the optimal number of threads per block.
You can start with warp size (smallest unit that can be scheduled) which is in my case 32, and then you can perform several tests with 64, 128, 256, 512 and 1024 (maximum value) threads per block.
Common values for a number of threads per block are somewhere between 32 and 512 (depends on GPU model, data types, array/matrix characteristics…).
By knowing the number of threads and the array size, you can calculate a number of blocks.
block_cnt = int(math.ceil(float(N) / thread_cnt))
N is a number of elements in array/matrix, while thread_cnt is a number of threads per block that I have from the previous step (e.g. 64).
Summary:
To get the best results, try to avoid all imposed limits and to have enough number of threads and blocks to keep all CUDA cores busy.
In the next blog I’ll take a look at a CUDA performances versus optimized Numba Python code execution.
Comments