feat(treesitter): treesitter based markdown parsing#201
feat(treesitter): treesitter based markdown parsing#201ten3roberts wants to merge 17 commits intoYousefHadder:mainfrom
Conversation
|
Thank you for initiating this change I really appreciate it. One note is that currently, TreeSitter utilities are in I'd like to refactor the shared TreeSitter utilities into a dedicated module before extending to other features. Proposed StructureThis way:
I noticed you plan to open an issue for tracking this, I really appreciate if you can consider the idea of a shared TS module for this refactoring. |
- Add centralized debug logging utility (ts.log) for treesitter operations
- Fix infinite loop caused by aggressive parser:invalidate() on every call
- Query tree directly with named_descendant_for_range instead of vim.treesitter.get_node
- Add proper type annotations (TSNode, vim.treesitter.LanguageTree)
- Add centralized node type constants (M.nodes) to avoid hardcoded strings
- Update documentation with treesitter integration section and troubleshooting
- Improve logging to distinguish expected fallbacks (letter lists) from unexpected ones
Debug logging can be enabled with:
require('markdown-plus.format.treesitter').debug = true
There was a problem hiding this comment.
Pull request overview
Refactors markdown parsing to use a shared Tree-sitter utility module for more accurate detection (with regex fallback), and updates list/header/footnote logic to benefit from Tree-sitter where available.
Changes:
- Added a shared
markdown-plus.treesittermodule and migrated code-block/header detection to use it with regex fallback. - Updated list parsing API to accept a row number and prefer Tree-sitter parsing when possible.
- Adjusted specs and multiple list handlers/utilities to pass row numbers for Tree-sitter-aware parsing.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/markdown-plus/list_spec.lua | Updates tests to parse via buffer+row to exercise the new row-aware list parsing path. |
| lua/markdown-plus/utils.lua | Switches code-block detection to the new shared Tree-sitter module. |
| lua/markdown-plus/treesitter/init.lua | Introduces shared Tree-sitter helpers (parser access, node lookup, fenced code block helpers). |
| lua/markdown-plus/table/conversion.lua | Treats ~~~ as a code fence in CSV/table heuristics. |
| lua/markdown-plus/list/shared.lua | Passes row numbers into list parsing for Tree-sitter support. |
| lua/markdown-plus/list/renumber.lua | Passes row numbers into list parsing for Tree-sitter support. |
| lua/markdown-plus/list/parser.lua | Adds Tree-sitter-driven list parsing (with regex fallback) and changes the parse API to accept row. |
| lua/markdown-plus/list/handlers.lua | Updates handlers to pass cursor row into list parsing and clarifies code-block skipping wrapper docs. |
| lua/markdown-plus/list/checkbox.lua | Passes line_num into list parsing for Tree-sitter support. |
| lua/markdown-plus/headers/parser.lua | Uses Tree-sitter line sets to exclude fenced code blocks when collecting headers (regex fallback retained). |
| lua/markdown-plus/format/treesitter.lua | Delegates Tree-sitter node retrieval and fenced code block detection to the shared module. |
| lua/markdown-plus/format/patterns.lua | Updates comments around Tree-sitter node types (but contains an outdated reference). |
| lua/markdown-plus/footnotes/parser.lua | Uses Tree-sitter line sets to exclude fenced code blocks (regex fallback retained). |
Performance: hybrid TS + regex strategyI benchmarked treesitter vs regex across different buffer sizes (100–2000 lines, 1000 iterations each) and found that TS is not universally faster:
The cost comes from Based on these numbers, I've switched to a hybrid approach in the latest commits:
The value of treesitter in this PR is accuracy (correct code block detection that regex can't match), not speed. The hybrid approach gives us both: correctness from TS where it matters, and performance from regex where TS is slower. |
|
@ten3roberts let me know what you think of the recent changes and the above comment. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
lua/markdown-plus/list/parser.lua:277
parse_list_linefalls back to regex whenever the treesitter attempt returns nil. That means if treesitter is available but determines the row is not a list item (e.g., inside fenced code blocks or other constructs TS parses differently), regex can still misclassify the line as a list—undermining the stated goal of using TS for more accurate parsing. Consider making TS authoritative when available (e.g., only fall back to regex when TS is unavailable, or when TS positively identifies a list_item but marker parsing needs regex such as letter markers).
---Parse a line to detect list information
---Uses treesitter when row is provided and available, falls back to regex
---@param line string Line to parse
---@param row number 1-indexed row for treesitter
---@return markdown-plus.ListInfo|nil List info or nil if not a list
function M.parse_list_line(line, row)
if not line then
return nil
end
-- Try treesitter first (if row provided)
local ts_result = row and parse_list_line_ts(row) or nil
if ts_result then
return ts_result
end
-- Fall through to regex if ts returns nil
-- (handles letter lists, ts unavailable, continuation lines, etc.)
-- Fallback to regex
return parse_list_line_regex(line)
end
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
lua/markdown-plus/treesitter/init.lua:165
- The docstring says this “efficiently queries all nodes”, but the implementation recursively walks the entire syntax tree. Either adjust the comment to match the behavior, or switch to a TS query-based approach if you want the “query” efficiency claim to be accurate.
---Get set of line numbers inside nodes of a specific type
---Efficiently queries all nodes of the type and collects their line ranges
---@param node_type string Node type to find (e.g. M.nodes.FENCED_CODE_BLOCK)
---@return table<number, boolean>|nil Line number set (1-indexed), or nil if ts unavailable
function M.get_lines_in_node_type(node_type)
local parser = M.get_parser()
|
Thank you. I think they make sense What did you use for testing the perf? And I agree, TS advantage is accuracy (especially for e.g; C++ which has syntax that is non-parseable for regex with function signature declarations for instance), and the iterative nature of it. Startup costs/0 costs are higher So a hybrid approach is best |
|
@ten3roberts I used a performance test written by copilot to measure and compare performance of both. |
Description
This PR supplement the existing markdown structure parsing which uses string.match and regex-like parsing to utilizing treesitter for more accurate parsing, and less error prone parsing of e.g; commented lists, nested codeblock syntax and similar which can throw off regex based parsing as they can't always handle complex escaping.
If the TS parsing is not available, it falls back to the previous regex parsing
Type of Change
Related Issues
Fixes [... TODO: open issue]
Testing
Checklist
Remaining Items before finished