DNN compilation is constrained by data movement optimization. VTC offers virtual tensors to eliminate unnecessary data transfers and reduce latency.

The problem does not manifest immediately — until the moment when accelerators begin to compute faster than memory can supply data. In modern GPUs, the gap between compute and memory latency is growing, and more DNNs are becoming memory-bound. This is especially noticeable in the inference of large models, where access to global memory defines latency more than the computations themselves. Classic optimizations, such as operator fusion and layout transformation, only work on parts of the graph and miss significant sources of unnecessary movement.

VTC (Virtual Tensor Compiler) attacks the problem from a different angle. Instead of physically moving data between operators, it introduces virtual tensors — a representation where data is not copied but described through a mapping function (indexing). Essentially, the tensor becomes a function for accessing other tensors. This allows the compiler to eliminate entire chains of data movement operators while maintaining computational correctness. Architecturally, the solution relies on two mechanisms: a virtual tensor opportunity graph for exploring options and a greedy algorithm that selects the strategy with the maximum reduction in latency.

A key insight is that modern compute cores require contiguity only in local memory, not in global memory. This allows for relaxed layout requirements and replaces copying with address computation. As a result, VTC eliminates unnecessary operations like Transpose, ScatterND, or Expand without additional kernel calls. In GPU experiments, this yields up to 1.93× speedup (averaging 1.28×) and up to 60% memory savings (averaging 17.5%). An analysis of the LLM decoder layer shows that data movement can take more time than the compute operations themselves — and this optimization layer provides the greatest effect.

From a practical standpoint, this appears as an evolutionary extension of the compiler rather than a replacement of existing approaches. VTC does not conflict with operator fusion or layout optimization; it complements them. The trade-off is more complex addressing logic and potential overheads in unsuccessful mapping. Therefore, the compiler evaluates profitability: fully contiguous cases are always beneficial, while partially contiguous cases depend on size and access patterns. For engineering teams, this is a signal: further performance growth in DNNs will depend not on FLOPS, but on how aggressively we eliminate data movement at the compilation level.

Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

DNN compilation is constrained by data movement optimization. VTC offers virtual tensors to eliminate unnecessary data transfers and reduce latency.

🚀 Deploy the Blocks