Skip to main content
GPUCUDAPyTorchPerformance

20x GPU Speedup on Multimedia Indexing: Cache Locality, Batch Shape, and Where I Stopped

The problem

The Multimedia File Indexer was the winning entry for Smart India Hackathon 2022 — a Government of India problem statement from MP Police who needed to index seized digital evidence (documents, images with OCR-extracted text, audio transcripts) fast enough to be useful in a live investigation. It was later adopted into the Samsung PRISM 2023 program, where the target got tighter: 1,000+ files, TF-IDF + downstream NLP features, indexed in under two seconds on a single consumer GPU.

The CPU baseline for our workload was ~30 seconds for 1,000 files. That number is the whole post. Everything else is a consequence of trying to close it.


The naive first port

The first GPU port was the version anyone would write in an afternoon. Tokenize on CPU, stack the token-ID tensors into one big batch, .cuda(), run the TF-IDF math and the transformer feature extractor, .cpu(), write the index. On our workload this got us to about 8 seconds — a 3.75x over the CPU baseline.

That is the version people quote as a "GPU speedup" and stop. I got suspicious of it almost immediately, because the GPU utilization graph looked like a heart monitor of a mostly-dead patient — spikes to 90% for 40 ms at a time, flat green in between. The kernel was fine. Everything around the kernel was the problem.


The 6x that came from batch shape, not batch size

The reflex when a GPU is underutilized is to increase batch size. I did that first, and it helped a little, and then it stopped helping, and then it started hurting. The real win came from reshaping the batch tensor.

Our tokenized inputs were [num_files, max_seq_len, embedding_dim] — the shape a Hugging Face example would hand you. max_seq_len in our corpus was long-tailed. A handful of transcript-heavy files dragged the padded length up to ~4,096 tokens, while the median file was under 400. So the batch tensor was mostly zeros, and worse, the mostly-zero rows were laid out such that the inner dimension a warp reads on each iteration crossed cache lines it didn't need to.

The fix was to sort files by sequence length, bucket into fixed-length groups (256, 512, 1024, 2048, 4096), and reshape each bucket so the contiguous dimension in memory was the one the kernel walked hottest. Same GPU, same total FLOPs, same batch size in aggregate. Different memory access pattern.

The kernel now touched L1 and L2 the way the hardware was built to serve. Throughput went up 6x on this change alone.

There's a lesson in there that I keep re-learning: on a GPU, "batch size" is a proxy for "am I feeding the SMs enough parallel work," but the actual bottleneck is almost never arithmetic — it's memory. Batch shape is where the memory access pattern lives.


torch.compile, and the win I didn't expect

I turned on torch.compile next expecting kernel fusion to be the story. It wasn't.

# The hot loop of the indexer. compute_tfidf_features is called once per bucket.
# The @torch.compile decorator picked up the bucket loop after a warmup pass.
 
@torch.compile(mode="reduce-overhead", fullgraph=True)
def compute_tfidf_features(token_ids: torch.Tensor,
                           attention_mask: torch.Tensor,
                           idf_weights: torch.Tensor) -> torch.Tensor:
    # fusion win: term_freq -> tfidf -> normalize collapsed into one kernel.
    # the bigger win: Python overhead on the per-bucket call went from ~200us
    # to ~20us because the whole thing became a single CUDA graph replay.
    tf = token_frequency(token_ids, attention_mask)   # [B, V]
    tfidf = tf * idf_weights                          # [B, V]
    return torch.nn.functional.normalize(tfidf, dim=-1)

Kernel fusion did help — the term_freq → multiply → normalize chain collapsed from three kernel launches into one, and that shows up on the timeline. But the surprise was that the biggest chunk of wall-clock savings came from mode="reduce-overhead" cutting the Python-side dispatch cost on the hot loop. Per-bucket call time dropped from ~200μs to ~20μs. When you're calling that function a few thousand times per index run, an order of magnitude off the per-call overhead is a bigger win than the kernel-level fusion the marketing pages talk about.

torch.compile also silently regressed a small path. Our tokenizer output had a dynamic vocabulary reduction pass — filtering out tokens with document frequency below a threshold — and the shape of the resulting tensor depended on the corpus, not the model. That triggered a recompile on every run. I moved that step out of the compiled region and pinned it to eager mode. The tell was TORCH_LOGS=recompiles — worth turning on before you trust any torch.compile speedup number.


Pinned memory + async H2D killed the last stall

The CPU-to-GPU copy was the last visible flat line on the utilization graph. Two changes fixed it:

  1. Allocate the CPU-side batch tensors in pinned memorytorch.empty(..., pin_memory=True). Pinned pages don't get paged out, so the DMA engine can copy directly instead of waiting for the OS to reserve staging memory.
  2. Kick the H2D copy on a separate CUDA stream with non_blocking=True, so the copy for bucket N+1 overlaps with the compute for bucket N.

This is a classic double-buffering pattern. The reason it wasn't in the first port is the same reason it's never in the first port: it doesn't matter until it does. Once the compute got fast enough (from batch shape + torch.compile), the copy stall became a visible fraction of the wall-clock, and only then was it worth the code complexity of managing two streams and a producer-consumer buffer.


The speedup breakdown

Numbers below are directional, from our workload — 1,000 mixed multimedia files on a single consumer GPU (RTX 3060 class). Your mileage will vary with corpus, sequence length distribution, and hardware generation.

StageWall clockSpeedup vs CPU baselineNotes
CPU baseline~30.0 s1.0xmulti-threaded, cold-cache
Naive GPU port~8.0 s3.75xone big batch, .cuda(), done
+ batch shape reshape~1.35 s22xlength-bucketed, cache-line aligned
+ torch.compile (reduce-overhead)~1.15 s26xPython overhead dropped, not fusion
+ pinned memory + async H2D~1.5 s*~20x*slight regression on tiny corpora; see below

The pinned-memory row is where the honest reporting matters. On the target 1,000-file workload, the async copy overlap was neutral to slightly negative — the compute was already so fast that the copy setup overhead cost more than the parallelism bought. On 10,000-file test runs, the same code paid off cleanly. I shipped it because the eval that mattered was the larger corpus, and I'd rather have a solution that scales up than one that wins on the demo size.

The steady-state I actually landed on was ~20x on the target workload.


The three days I spent chasing 22x

After landing at 20x, I spent three engineer-days trying to squeeze the next ~10%. Things I tried:

  • Custom CUDA kernel for the TF weighting step. Wrote it in Triton. Got it working. It was 3% faster than the compiled PyTorch version and 300% harder to maintain. Threw it away.
  • Half precision (fp16) for the transformer feature pass. Fine on throughput, but the downstream cosine-similarity search on the resulting features had a distribution shift I couldn't quickly bound the impact of on retrieval quality. Rolled back.
  • CUDA Graphs (manual, not via torch.compile). Marginal gain over what reduce-overhead was already doing under the hood, and it locked us to fixed batch shapes in a way that would have broken the length-bucketing.

None of these were dead ends in principle. Any of them might have paid off with another week of engineering. But this was a hackathon-to-PRISM project on a shoestring, and the throughput/engineer-cost curve had bent sharply. I stopped.

The general rule I have now: past the first 10-20x on GPU work, the marginal speedup goes exponential in engineer-hours. If the SLO is met, you stop.


A horizontal cumulative bar chart of five speedup steps for the multimedia-indexing pipeline. The baseline naive DataLoader sits at 1.0x. Adding pinned memory and async host-to-device copy takes it to 6.0x (+5.0x marginal). Length-bucketed batching lands at 12.0x (+6.0x marginal). fp16 autocast with channels-last layout reaches 18.0x (+6.0x marginal). torch.compile with dispatch-overhead reduction closes at 20.0x (+2.0x marginal). Bars share a common baseline and extend right along an axis with ticks at 1x, 5x, 10x, 15x, and 20x. A right-margin callout reads 'wall-clock at 20x: 143 → 7.1 min'.
Every gain came from memory access, not from more FLOPs. Same batch size, same model, same corpus — the kernel was never the bottleneck; the path from RAM to L2 was.

What I'd do at 10x scale

At 10,000 files per batch this pipeline is fine. At 100,000, or in a streaming setting where files arrive continuously and need to appear in the index within seconds, the trade flips.

The path I would take:

  1. Feature store, not a monolithic pass. Precompute per-file features asynchronously as files land in the ingest queue; the "index build" step just aggregates a materialized view. The 20x on the hot loop stops being interesting when the wall-clock is dominated by the queue depth.
  2. Async prefetch with a proper producer-consumer buffer. The pinned-memory + async H2D pattern was one worker deep. At scale, you want two or three workers feeding a bounded queue, and the tokenizer running on a separate CPU pool with a warm process pool — cold Python VMs on each ingest event were 40% of small-corpus latency in a rough profile I did later.
  3. Move to a proper embedding server. At that point the transformer pass belongs behind a Triton Inference Server or a small vLLM instance with continuous batching. Squeezing more out of a single-process PyTorch loop isn't the right investment when the batch dynamics are being driven by queue traffic, not by a fixed corpus.
  4. Sparse TF-IDF on CPU, dense features on GPU. Half of what I was pushing through the GPU was sparse arithmetic that a good CPU SIMD path (e.g. via scipy.sparse on a large-core machine) handles just as well. Split the pipeline by which arithmetic actually wants the hardware.

The meta-lesson mirrors the one from the HNSW/IVF-PQ post: optimize for the regime you're in, not the regime you might be in. A 20x on a single-GPU, fixed-corpus, hackathon-timeline pipeline is the right answer. A 20x on the same code path at streaming scale would be the wrong answer — because the wall-clock isn't in the kernel anymore.


What surprised me

The thing I quote most from this project isn't the 20x. It's that torch.compile's biggest win, on this workload, wasn't fusion — it was cutting the Python-side dispatch overhead on the hot loop by an order of magnitude. Every writeup I'd read about torch.compile framed it as a fusion story. On a small-kernel, high-iteration-count workload like ours, the fusion helped and the dispatch reduction helped ten times more.

The other one: I spent more time reasoning about memory layout than about arithmetic. The GPU wanted to do the math. The GPU was ready to do the math. My job, most of the time, was to hand it the math in a shape it could read without leaving cache.


See also


More on the Multimedia File Indexer — Samsung PRISM 2023 Excellence Award, Smart India Hackathon 2022 winner adopted by MP Police — is on the projects page.

Cite as: Saravanan, K. (2026). 20x GPU Speedup on Multimedia Indexing: Cache Locality, Batch Shape, and Where I Stopped. Kaushik Saravanan. https://www.kaushik.cv/blog/gpu-multimedia-indexing-20x