In the last post I’ve explained how to install nVidia Toolkit 10.1 on Ubuntu 18.04 LTS.

Details can be found on the following pages:


In this article I’ll explain the most important card details you need to know.

Prerequisite for this article (besides nVidia drivers and Toolkit 10.1 Toolkit installed) is to have all samples compiled.

All you need to do is to cd into the Samples directory and execute the following:

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro M2200"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4044 MBytes (4240179200 bytes)
  ( 8) Multiprocessors, (128) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1036 MHz (1.04 GHz)
  Memory Clock rate:                             2754 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

Here are the list with short explanation of the most important specs needed for CUDA tuning:


Detected 1 CUDA Capable device(s)

In this case I have only one CUDA device. In reallity you might have 2, 3 or more on one machine.


Device 0: “Quadro M2200”

This is a my nVidia device name. If I need to search for some information, I need to know full card name.

To get the full card name, you can execute the following:

user@hostname:~>lspci | grep -e VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GM206GLM [Quadro M2200 Mobile] (rev a1)

It’s also good to know installed driver version of your card.

You can find that by executing the following:>nvidia-smi 
Mon Mar 25 17:28:46 2019       
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Quadro M2200        Off  | 00000000:01:00.0  On |                  N/A |
| N/A   44C    P0    N/A /  N/A |   1631MiB /  4043MiB |      0%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      1833      G   /usr/libexec/Xorg                             89MiB |
|    0      2022      G   /usr/bin/gnome-shell                          44MiB |
|    0      2397      G   /usr/libexec/Xorg                            695MiB |
|    0      2522      G   /usr/bin/gnome-shell                         252MiB |
|    0      2754      G   ...are/jetbrains-toolbox/jetbrains-toolbox   401MiB |
|    0      2762      G   cairo-dock                                     9MiB |
|    0      3712      G   ...-token=5C40E868867C4C7F98ABDA339E4D154C   129MiB |

#or you can get less details by executing: 
user@hostname:/usr/local/cuda-10.0/samples>cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.56  Fri Mar 15 12:59:26 CDT 2019


CUDA Driver Version / Runtime Version 10.1 / 10.1

This is important to know in case you want to install GPU related packages (like a packages from the nVidia RAPIDS projects) that are not available in the Anaconda default repo.


CUDA Capability Major/Minor version number: 5.2

With this information you can open the following link:

and search for “Table 14 Technical Specifications per Compute Capability”.

In my case I can find that for my graphic card which has CUDA Capability 5.2, maximum number of resident blocks per multiprocessor is 32.

This is the first limitation when tuning CUDA parameters you need to be aware.


Total amount of global memory: 4044 MBytes (4240179200 bytes)

This information will play a role to properly size the maximum size of arrays or matrix that you want to load into your Graphic card.

You need to combine a memory size information with array data type (byte/int/float) and type precision (e.g. int8 or float64) to calculate your graphic card memory limits, thus avoiding OOM (Out of memory) errors in your application.


( 8) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores

There are two important data you can observe.

My graphic card has 8 Multiprocessors, and each multiprocessor has 128 CUDA cores.

In total if I multiply 8×128 = 1024, which is a total number of CUDA cores available.

Each streaming multiprocessor contains:

  • memory registers
  • memory caches
  • thread scheduler
  • CUDA cores


GPU Max Clock rate: 1036 MHz (1.04 GHz)

This is similar to CPU. Faster clock rate means faster GPU processing.


Memory Clock rate: 2754 Mhz

Memory Bus Width: 128-bit

L2 Cache Size: 1048576 bytes

Memory parameters are important to speed up data transfer from the main RAM memory to GPU device, and back from the GPU to CPU.

As I’ll explain in one of the future articles, data transfer can take 50% of the total elapsed time for GPU processing.


Maximum number of threads per multiprocessor: 2048

This is related to point 6 and imposes another limitation when tuning  CUDA parameters (point 4 is the first limit).

In this case I have 8 multiprocessors and each multiprocessors can have up to 2048 threads.


Maximum number of threads per block: 1024

This is the third limit when tuning the CUDA cores.


Warp size: 32

Group of 32 threads in this case is called warp and represent the smallest unit that can be scheduled.

It means that the size of a thread block is always multiple of 32 threads.

To summarize, we have 3 main limits:

  • from point 4 I can have at max 32 blocks per multiprocessor
  • from point 6 I have 8 multiprocessors and 1024 CUDA cores in total (128 per multiprocessor)
  • from point 9 I can have at max 2048 threads per multiprocessor

To get the optimal performance:

  • try to use all symmetric multiprocessing units
  • try to use all cores in symmetric multiprocessing units busy
  • optimize shared memory and registers
  • optimize device memory
  • optimize device memory access

Threads are grouped into blocks which are grouped into a grid.

Grid can execute only one kernel function (code that will be executed on GPU) and have the same dimension.

This is a very brief explanation of CUDA architecture, but to explain how it works you better purchase one of the books that are available.

The following example might help to describe what you need to know about CUDA as a developer.

Let’s assume I have 1 million integers in array.

When you call a kernel function, you only need to find the optimal number of threads per block.

You can start with warp size (smallest unit that can be scheduled) which is in my case 32, and then you can perform several tests with 64, 128, 256, 512 and 1024 (maximum value) threads per block.

Common values for a number of threads per block are somewhere between 32 and 512 (depends on GPU model, data types, array/matrix characteristics…).

By knowing the number of threads and the array size, you can calculate a number of blocks.

block_cnt = int(math.ceil(float(N) / thread_cnt))

N is a number of elements in array/matrix, while thread_cnt is a number of threads per block that I have from the previous step (e.g. 64).


To get the best results, try to avoid all imposed limits and to have enough number of threads and blocks to keep all CUDA cores busy.

In the next blog I’ll take a look at a CUDA performances versus optimized Numba Python code execution.

Get notified when a new post is published!



There are no comments yet. Why not start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.