GPU Computing

Programmazione GPU: CUDA vs ROCm vs Vulkan Compute — Warp Scheduling, Shared Memory e Moltiplicazione di Matrici

March 20268 min lettura

Lucio Durán

Engineering Manager & AI Solutions Architect

Disponibile anche in: English, Español

Il Modello di Esecuzione: Warp, Wavefront e Subgroup

Partiamo da cosa succede realmente quando il tuo codice colpisce la GPU.

NVIDIA (CUDA): Warp da 32

L'unità di esecuzione fondamentale su NVIDIA (CUDA) è il warp — 32 thread che eseguono la stessa istruzione in lockstep (SIMT: Single Instruction, Multiple Threads).

// Sembra innocuo ma massacra le prestazioni
__global__ void bad_branch(float* data, int n) {
 int tid = threadIdx.x + blockIdx.x * blockDim.x;
 if (tid < n) {
 if (data[tid] > 0.0f) {
 data[tid] = sqrtf(data[tid]); // Percorso A
 } else {
 data[tid] = -sqrtf(-data[tid]); // Percorso B
 }
 }
}
// Se i thread 0-15 prendono il Percorso A e i thread 16-31 il Percorso B,
// il warp esegue ENTRAMBI i percorsi sequenzialmente. 50% utilizzo.

AMD (ROCm/HIP): Wavefront da 64

L'equivalente AMD è il wavefront — 64 thread sulle architetture CDNA/RDNA. Questa non è solo una differenza di nome; cambia fondamentalmente come pensi all'occupancy e alle primitive a livello warp.

__global__ void hip_reduce(float* data, float* result, int n) {
 __shared__ float sdata[256];
 int tid = threadIdx.x;

 sdata[tid] = (tid < n) ? data[tid] : 0.0f;
 __syncthreads();

 // PERICOLO: questa riduzione assume warp size di 32
 for (int s = blockDim.x / 2; s > 32; s >>= 1) {
 if (tid < s) sdata[tid] += sdata[tid + s];
 __syncthreads();
 }

 // Questo unrolling a livello warp è SBAGLIATO su AMD
 if (tid < 32) { // Dovrebbe essere < 64 su AMD!
 volatile float* vsmem = sdata;
 vsmem[tid] += vsmem[tid + 32];
 vsmem[tid] += vsmem[tid + 16];
 vsmem[tid] += vsmem[tid + 8];
 vsmem[tid] += vsmem[tid + 4];
 vsmem[tid] += vsmem[tid + 2];
 vsmem[tid] += vsmem[tid + 1];
 }
}

Questo bug esatto emerge quando si porta una codebase CUDA a ROCm per cluster MI250X. hipify-perl traduce la sintassi perfettamente ma non segnala l'assunzione sulla dimensione del warp. I risultati possono essere sbagliati del ~3% — abbastanza vicini da passare i test unitari con tolleranze larghe ma completamente sbagliati per simulazioni fisiche.

Vulkan Compute: Subgroup

Vulkan non specifica una dimensione del subgroup. Può essere 4, 8, 16, 32, 64 o anche 128:

#version 450
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_arithmetic : enable

layout(local_size_x = 256) in;
layout(set = 0, binding = 0) buffer Data { float data[]; };
layout(set = 0, binding = 1) buffer Result { float result; };

void main() {
 uint tid = gl_LocalInvocationID.x;
 float val = data[tid];
 float subgroup_sum = subgroupAdd(val);

 if (subgroupElect()) {
 atomicAdd(result, subgroup_sum);
 }
}

Shared Memory: La Risorsa Decisiva

La shared memory è la SRAM veloce on-chip che rende pratica la computazione GPU. Su GPU NVIDIA moderne, è organizzata in 32 bank, ciascuno largo 4 byte.

Bank Conflict: L'Assassino Silenzioso

// Matrice in shared memory: 32 x 32 float
__shared__ float tile[32][32];

// Thread (tx, ty) legge per colonna:
// Thread 0 legge tile[0][0] -> bank 0
// Thread 1 legge tile[1][0] -> bank 0 (!!!)
// Tutti i 32 thread colpiscono bank 0 -> conflitto a 32 vie!

// Fix: padding di ogni riga con 1 elemento
__shared__ float tile[32][33]; // <-- il numero magico

// Adesso:
// Thread 0 legge tile[0][0] -> bank 0
// Thread 1 legge tile[1][0] -> bank 1 (offset di 33 % 32 = 1)
// Tutti i 32 thread colpiscono bank diversi -> nessun conflitto!

Quel [32][33] invece di [32][32] è il tipo di cosa che trasforma un kernel al 40% del picco in uno all'80%. Un elemento extra per riga. Questo singolo cambiamento può dare speedup di 2x su kernel memory-bound.

Moltiplicazione di Matrici: SGEMM Ottimizzato

Il triplo loop naif risulta insufficiente per l'uso in produzione. Il seguente kernel SGEMM a tile è stato ottimizzato iterativamente. Questa versione raggiunge ~75% delle prestazioni di cuBLAS su una A100:

#define BM 128
#define BN 128
#define BK 8
#define TM 8
#define TN 8

__global__ void sgemm_optimized(
 const float* __restrict__ A,
 const float* __restrict__ B,
 float* __restrict__ C,
 int M, int N, int K
) {
 const int bx = blockIdx.x;
 const int by = blockIdx.y;
 const int tx = threadIdx.x;
 const int ty = threadIdx.y;

 __shared__ float As[BK][BM + 1];
 __shared__ float Bs[BK][BN + 1];

 float accum[TM][TN] = {0.0f};
 float a_reg[TM];
 float b_reg[TN];

 const int row_base = by * BM;
 const int col_base = bx * BN;

 for (int k = 0; k < K; k += BK) {
 #pragma unroll
 for (int i = 0; i < BM; i += blockDim.y) {
 int row = row_base + ty + i;
 int col = k + tx;
 As[tx][ty + i] = (row < M && col < K) ? A[row * K + col] : 0.0f;
 }

 #pragma unroll
 for (int i = 0; i < BN; i += blockDim.y) {
 int row = k + ty;
 int col = col_base + tx + i;
 Bs[ty][tx + i] = (row < K && col < N) ? B[row * N + col] : 0.0f;
 }

 __syncthreads();

 #pragma unroll
 for (int kk = 0; kk < BK; kk++) {
 #pragma unroll
 for (int m = 0; m < TM; m++)
 a_reg[m] = As[kk][ty * TM + m];
 #pragma unroll
 for (int n = 0; n < TN; n++)
 b_reg[n] = Bs[kk][tx * TN + n];
 #pragma unroll
 for (int m = 0; m < TM; m++)
 #pragma unroll
 for (int n = 0; n < TN; n++)
 accum[m][n] += a_reg[m] * b_reg[n];
 }

 __syncthreads();
 }

 #pragma unroll
 for (int m = 0; m < TM; m++)
 #pragma unroll
 for (int n = 0; n < TN; n++) {
 int row = row_base + ty * TM + m;
 int col = col_base + tx * TN + n;
 if (row < M && col < N)
 C[row * N + col] = accum[m][n];
 }
}

Decisioni chiave di ottimizzazione: store trasposto in shared memory per A, register tiling (TM x TN) con ogni thread che computa un sub-tile 8x8, puntatori __restrict__, e #pragma unroll critico per gli inner loop.

Benchmark: Numeri Reali su Hardware Reale

I seguenti benchmark mostrano SGEMM 4096x4096 su tre setup:

Implementazione	Hardware	TFLOPS	% del Picco
cuBLAS	A100 80GB	17,8	91%
Kernel CUDA custom	A100 80GB	14,7	75%
rocBLAS	MI250X	15,2	88%
Kernel HIP custom	MI250X	11,9	69%
Vulkan Compute (GLSL)	RTX 4090	9,3	56%
Vulkan Compute (GLSL)	RX 7900 XTX	7,1	52%

Le librerie del vendor fanno trucchi a livello assembly che è impraticabile replicare in codice alto livello.

Profiling: Distribuzione del Tempo di Esecuzione

L'abilità più preziosa nella programmazione GPU è leggere l'output di un profiler:

Kernel: sgemm_optimized
Duration: 2.31 ms
Theoretical Occupancy: 75%
Achieved Occupancy: 71.2%
SM Throughput: 87.3%
Memory Throughput: 62.1% <-- Non è memory bound
Compute Throughput: 87.3% <-- È compute bound

Shared Memory Bank Conflicts: 0 <-- Il padding ha funzionato

Warp Stall Reasons:
 stall_mio_throttle: 23.1% <-- Bandwidth shared memory
 stall_math_pipe_throttle: 18.7% <-- Bene! L'ALU è saturata
 stall_short_scoreboard: 15.2%
 stall_not_selected: 12.4%

Quel stall_mio_throttle al 23.1% indica che la bandwidth della shared memory è il prossimo collo di bottiglia. Il fix sarebbe aumentare la dimensione del register tile (TM/TN) per fare più computazione per ogni load dalla shared memory, al costo di maggiore pressione sui registri e potenzialmente minore occupancy.

Questo è il gioco. Ogni ottimizzazione crea un nuovo collo di bottiglia. Hai finito quando il profiler ti mostra che sei limitato da qualcosa di fondamentale.

Framework Decisionale

Una valutazione onesta dei tre ecosistemi dopo uso estensivo:

CUDA ha il miglior tooling di gran lunga. Nsight Compute, Nsight Systems — la storia di debugging e profiling è imbattibile. Se sei su hardware NVIDIA, non c'è ragione di usare altro per il compute.

ROCm è migliorato drasticamente ma si sente ancora 2 anni indietro rispetto a CUDA nel tooling. Il linguaggio HIP è 98% source-compatible con CUDA, il che rende il porting diretto.

Vulkan Compute è la scelta giusta quando hai bisogno di una codebase che gira su Intel, NVIDIA, AMD e anche GPU mobile. Il trade-off è meno controllo sulle specificità dell'hardware e un'API più verbosa.

Scegli in base ai tuoi vincoli, non al tribalismo. E per favore, profila prima di ottimizzare.

Tips

Tip: Profila con ncu --set full (Nsight Compute) prima di toccare il codice del tuo kernel. Capita spesso di vedere team passare giorni a 'ottimizzare' un kernel che era memory-bound, aggiungendo istruzioni aritmetiche che lo scheduler nascondeva dietro la latenza di memoria comunque. Prima conosci il tuo collo di bottiglia.

Tip: Su GPU AMD, la dimensione del wavefront è 64 (non 32 come i warp CUDA). Se porti codice CUDA a ROCm con hipify-perl, audita ogni __shfl_sync e primitiva a livello warp manualmente — la traduzione automatica azzecca la sintassi ma sbaglia la semantica per algoritmi che dipendono dalla dimensione del warp.

Tip: Per Vulkan Compute, imposta sempre VK_KHR_shader_non_semantic_info nelle estensioni del device e compila GLSL con il flag -g. Senza, RenderDoc mostra il tuo compute shader come un blob opaco e il debugging diventa tirare a indovinare.

cudarocmvulkan-computeprogrammazione-gpumoltiplicazione-matriciwarp-schedulingshared-memory

Strumenti menzionati in questo articolo

AWSProva AWS

ReplicateProva Replicate

Divulgazione: Alcuni link in questo articolo sono link di affiliazione. Se ti registri tramite questi, potrei guadagnare una commissione senza costi aggiuntivi per te. Raccomando solo strumenti che uso e di cui mi fido personalmente.

X / Twitter LinkedIn WhatsApp

Seguime

Torna al Blog