[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

uv-xiao · 2026-01-15T07:39:37Z

Summary

This PR synchronizes TileScale with mainstream TileLang, integrating 577 commits of new features while preserving all TileScale distributed computing capabilities.

Key highlights:

TVM-FFI API modernization
All distributed TileOperators preserved and updated to new API

Related Issue: #46

Background

TileScale diverged from mainstream TileLang on July 21, 2025 (commit 8205791d). Since then:

Mainstream accumulated 577 commits with significant API changes
TileScale accumulated 135 commits with distributed features

This merge integrates both, ensuring no functionality is lost.

Documentation

Comprehensive documentation added in docs/sync_with_tilelang/:

Document	Description
`BUILD_AND_RUN.md`	Build instructions and test commands
`MERGE_RATIONALE.md`	Why this merge is needed
`MERGE_ANALYSIS.md`	Detailed feature status
`compare_to_tilang/`	Detailed study of TileScale contributions

Breaking Changes

C++ TileOperators now use TVM-FFI macros (already updated)
Some internal APIs changed (documented in merge analysis)

Commits

1cd95ce - Main merge commit
5e368a8 - Documentation updates

…on (tile-ai#1230) * Added new type mappings for int8, uint8, int16, uint16, int64, uint64, float64, bool, and uchar to the TLCPUSourceWrapper class. * Updated the initialization function to use a common format for the CPU backend, ensuring consistency and improved error handling with the addition of get_last_error(). * Refactored the get_cpu_init_func method to return the updated initialization function, enhancing clarity and maintainability.

…gement (tile-ai#1231) * Introduced a new issue template for planning releases, including fields for version, milestone, scope, tasks, readiness checks, and additional notes. * This template aims to streamline the release planning process and ensure all necessary information is captured for each release.

…tings (tile-ai#1200) * Add kernel selection option for GEMM v1 in environment settings - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version. - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable. - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable. * bug fix * Add kernel selection option for GEMM in environment settings - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations. - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value. - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable. * Refactor GEMM macro generator to use BufferRegion instead of Buffer - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`. - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types. - Simplified buffer access logic for better clarity and maintainability. * Refactor GEMM functions to utilize BufferRegion for improved memory handling - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices. - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion. - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability. - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access. * Refactor GEMM code for improved readability and consistency - Cleaned up formatting and spacing in GEMM-related files for better readability. - Standardized comments and code structure across various GEMM functions and macros. - Enhanced error messages for clarity in buffer region checks. - Removed redundant lines and improved overall code maintainability. * Update GEMM correctness evaluation and macro generator for improved functionality - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests. - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing. - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access. - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations. * Refactor GEMM and intrinsic files for improved clarity and functionality - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code. - Adjusted function signature formatting in `swizzle.py` for better readability. - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation. - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness. - Improved function signature formatting in `language.py` for consistency. * Enhance GEMM and MMA functionality for FP64 support - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection. - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations. - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage. - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms. - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations. * lint fix * Refactor GEMM correctness evaluation and shared memory alignment handling - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency. - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope. - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types. - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability. * Enhance GEMM and MMA implementations with region-based memory handling - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations. - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication. - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations. * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity. - Modified function calls to use the new module structure for better organization and maintainability in testing examples. * Enhance OnArrayDeclaration method to handle repeated buffer declarations - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations. - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations. * Add abbreviation for bfloat16 data type in mfma_macro_generator.py - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation. * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits. - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity. - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8. * lint fix * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings. - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables. - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity. - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency. * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity. - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements. * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity. * Refactor attention sink examples to simplify index calculations - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices. - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops. * Refactor attention sink examples to streamline index calculations - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices. - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity. * lint fix * bugfix * Refactor reduce operation handling in CUDA and Python - Removed outdated shared memory reduction logic from `reduce.cc`. - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes. - Updated CUDA header to define a wider accumulator type for better numerical accuracy. - Enhanced error handling for buffer scope validation in the reduction process. * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs * Enhance unit loop handling by refining annotation checks - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present. - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability. * clean clode

* add typing stub for tir.ir * remove idents * minor update

* Update layout handling and introduce reshape functionality - Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes. - Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents. - Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations. - Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios. * lint fix * Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1 * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1 * remove useless prove * remove comment --------- Co-authored-by: tilelang-bot <bot@tilelang>

* Add correctness evaluation script for GEMM v2 - Introduced a new Python script `correctness_evaluation_tcgen05.py` for testing the correctness of GEMM v2 implementations using pytest. - Implemented matrix multiplication and compilation checks, along with parameterized tests for various input configurations. - Enhanced the testing framework to validate GEMM operations with different data types and configurations, ensuring robustness in the implementation. - Updated logging in `legalize_negative_index.cc` to reduce verbosity by changing from WARNING to DLOG. - Adjusted assertions in `tcgen05_macro_generator.py` to accommodate new warp size requirements for improved performance. - Removed unused variable in `gemm_tcgen05.py` to streamline the codebase. * lint fix --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* Fix division by zero in RMS normalization * Fix rsqrt calculation to avoid division by zero

* Deleted the LoopVectorizeDynamic implementation from the transform module. * Removed associated references in the phase and initialization files to streamline the codebase. * This change simplifies the transformation pipeline by eliminating unused functionality. Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Enhancement] Add FP8 support and reproducibility in lighting indexer * Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance. * Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix * Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h` * Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity. * Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures. * test fix * bug fix

…during Layout Reshape (tile-ai#1248) * fix * Refactor tensor reshaping in fp8_lighting_indexer.py - Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance. - Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism. * Refactor analyzer usage in Layout and Fragment reshaping - Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues. - Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.

…-ai#1247) * [Refactor] Update buffer handling in copy and atomic operations * Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions. * Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity. * Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase. * Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations. * Remove obsolete .git_commit.txt file * Add unit test for dynamic copy extent handling in TileLang * Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building. * The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios. * The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix. * lint fix * fix * Revert "fix" This reverts commit 828b4c1. * Update TVM submodule and refactor atomic and copy functions * Updated the TVM submodule to a dirty state. * Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions. * Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage. * Enhance extent handling in atomic and copy functions * Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations. * Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents. * Added comments to clarify the extent handling logic. * Enhance `legalize_pairwise_extents` function with early-exit rule * Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments. * Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic. * lint fix

* add typing stub for tir.ir * remove idents * minor update * [Language] Add missing while statement * add test

* [BugFix] Add autotune and exp2 for GDN kernel * [Lint] * [Lint]

… with `-inf` instead of clearing accumulators. (tile-ai#1222) * Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. * lint * pre-commit * Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.

* [fix] NVRTC execution backend * [fmt] run pre-commit * [fix] coderabbit reviews * [test] add cuda-python to test dep * [fix] coderabbit reviews * [fix] CUDA 13 compatibility * [fix] sm90 * [fix] CUDA 13 compatibility * [fix] pre-commit * [fix] always use cuda::std::__atomic_ref_impl * [fix] restore to external API * Revert "[fix] restore to external API" This reverts commit 49bd875. * [fmt] use space instead tabs for py codegen * [fix] im2col API * [fix] revert atomic.h * [fix] dynamic shape * [refactor] extract common utils * [feat] support L2 persistent map * [fix] l2 persistent map * [fix] pre-commit * [fix] restore _TYPE_MAP * [fix] pre-commit * [fix] avoid duplicate TMA descs * [docs] add docstring * [fix] coderabbit * [fix] coderabbit * [fix] coderabbit * [fix] coderabbit

…e-ai#1260) * fix nsa bwd and atomic * [Lint] * [BugFix] - New implementation for atomicMax and atomicMin using atomicCAS - PTX version atomicAdd for single 16-byte data - Modify the test cases * [Lint] --------- Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* [Example] Add page table for gqa decode * [Example] Page table for varlen decoding * [Lint] * [Refactor] Remove redundant code * [Lint] * [Lint] * [Lint]

* add typing stub for tir.ir * remove idents * minor update * [Refactor] add numpy conversion for dtype * fix lint error * remove unused np.float_ in dtype conversion * fix type in np.int_ * fix typo * minor fix * remove debug files

…een in scores_max numerical stability (tile-ai#1148) * Keep the max of all blocks seen in scores_max for stability * ruff formatting

* [Docs] Improve installation guide * address comments

…n for better numerical stablity (tile-ai#1269) * Implement max score retention across blocks in FlashAttention for improved stability * fix manual pipeline parameters * Update examples/flash_attention/example_gqa_fwd_varlen.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix typo * more * fix a previous typo --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

)

…e-ai#1273)

* [BugFix] Adding extra parameters into autotune hashkey * lint * None check * check serializable

…#1218) * Fix various issues under int64_t static and dynamic shape. * Resolve reviewed issues. * Add unit test. * fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

Rachmanino · 2026-01-28T08:59:45Z

cc @chengyupku

LeiWang1999 and others added 30 commits November 11, 2025 22:10

[Fix] Fix a type that make wrong T.macro backtrace (tile-ai#1234)

2b1f599

[Bugfix] Minor fix in builder.py (tile-ai#1235)

30d8ded

[Language] Add type stubs for tir op (tile-ai#1239)

02cfc2a

* add typing stub for tir.ir * remove idents * minor update

RMSNorm epsilon refine in the example (tile-ai#1243)

468b1b7

* Fix division by zero in RMS normalization * Fix rsqrt calculation to avoid division by zero

[AMD] enable amd ci test & fix bug & fix dockerfile (tile-ai#1244)

b10d49b

[Minor] Remove git_commit.txt (tile-ai#1249)

c139855

[Language] Add missing while statement (tile-ai#1254)

5eb30a4

* add typing stub for tir.ir * remove idents * minor update * [Language] Add missing while statement * add test

[BugFix] Add autotune and exp2 for GDN kernel (tile-ai#1258)

eac96cd

* [BugFix] Add autotune and exp2 for GDN kernel * [Lint] * [Lint]

[AMD] Update CK for ROCm7 (tile-ai#1262)

729e66c

[Example] Add GQA decoding kernel with varlen page table (tile-ai#1265)

716dbef

* [Example] Add page table for gqa decode * [Example] Page table for varlen decoding * [Lint] * [Refactor] Remove redundant code * [Lint] * [Lint] * [Lint]

[EXAMPLE] In the flash attention example keep the max of all blocks s…

a2a2781

…een in scores_max numerical stability (tile-ai#1148) * Keep the max of all blocks seen in scores_max for stability * ruff formatting

[Docs] Improve Installation Guide (tile-ai#1270)

b3d6f03

* [Docs] Improve installation guide * address comments

[Bugfix] Fix multiple cg defination when using T.sync_grid (tile-ai#1272

220c323

)

[Minor] Remove from __future__ import annotations for python 3.8 (til…

b192251

…e-ai#1273)

[BugFix] Adding extra parameters into autotune hashkey (tile-ai#1274)

e805f8e

* [BugFix] Adding extra parameters into autotune hashkey * lint * None check * check serializable

Fix various issues under int64_t static and dynamic shape. (tile-ai…

49c8571

…#1218) * Fix various issues under int64_t static and dynamic shape. * Resolve reviewed issues. * Add unit test. * fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

Rachmanino requested a review from chengyupku January 16, 2026 02:44

fix a typo

9894b5e

Rachmanino removed the request for review from chengyupku January 16, 2026 03:13

Rachmanino added 10 commits January 16, 2026 11:26

Remove symbols created by Claude's hallucination

ecdd597

fix include logic in cuda codegen

1dec815

fix ldst.h

de0692b

fix more files

e2b9ea0

migrate from TIR_REGISTER_TL_OP to TIR_REGISTER_TL_TILE_OP

4cbcfb6

let all distributed examples pass

c10bf02

fix deepep regression via applying vectorization

4344f79

fix lint and remove Claude's merge doc

9124334

fix sdist

2173854

disable arm and macos

f6c13e9

Rachmanino force-pushed the uv/tilescale_tvmffi branch from 53acc76 to 75a1f38 Compare January 19, 2026 07:47

fix dist.yml

874e050

Rachmanino force-pushed the uv/tilescale_tvmffi branch from 75a1f38 to 874e050 Compare January 19, 2026 07:54

Rachmanino added 5 commits January 19, 2026 17:29

disable ci for arm and metal

6ea6d96

fix ts_ext

de8d36d

use sdist for ci

141e4d4

use tilelang's new ci

109e67a

use cmake rather than pyproject dependency for tilescale extension

af1906e

Rachmanino force-pushed the uv/tilescale_tvmffi branch from 35a8c0f to 0d38d16 Compare January 19, 2026 14:02

install torch before ts_ext

b51575d

Rachmanino force-pushed the uv/tilescale_tvmffi branch from 0d38d16 to b51575d Compare January 19, 2026 14:14

Rachmanino added 3 commits January 28, 2026 15:12

fix torch lib link bug

78fc8fe

add missing codegen

add3089

disable ci test for deepep

b241d88

Rachmanino requested a review from chengyupku January 28, 2026 08:59

fix gitignore bug

4ce401d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

uv-xiao commented Jan 15, 2026

Uh oh!

Rachmanino commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

Are you sure you want to change the base?

[Sync] Merge mainstream TileLang TVM-FFI features into TileScale #47

Conversation

uv-xiao commented Jan 15, 2026

Summary

Related Issue: #46

Background

Documentation

Uh oh!

Rachmanino commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants