Skip to content

feat(treesitter): treesitter based markdown parsing#201

Open
ten3roberts wants to merge 17 commits intoYousefHadder:mainfrom
ten3roberts:refactor/ts-parse-utils
Open

feat(treesitter): treesitter based markdown parsing#201
ten3roberts wants to merge 17 commits intoYousefHadder:mainfrom
ten3roberts:refactor/ts-parse-utils

Conversation

@ten3roberts
Copy link
Contributor

Description

This PR supplement the existing markdown structure parsing which uses string.match and regex-like parsing to utilizing treesitter for more accurate parsing, and less error prone parsing of e.g; commented lists, nested codeblock syntax and similar which can throw off regex based parsing as they can't always handle complex escaping.

If the TS parsing is not available, it falls back to the previous regex parsing

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Code refactoring
  • Performance improvement (treesitter is faster than lua string utils generally)

Related Issues

Fixes [... TODO: open issue]

Testing

  • Tested manually (using myself, and testing with tests and an example md file)
  • [-] Added/updated tests (if applicable)

Checklist

  • Code follows project style
  • Self-reviewed my code
  • Commented complex logic
  • Updated documentation (if needed)
  • No new warnings generated

Remaining Items before finished

  • Test on other non-md filetypes
  • Create documenting issue
  • Update repo docs
  • Daily-drive for 1day
  • Improve code comment quality

@ten3roberts ten3roberts changed the title Feat: Treesitter based markdown parsing [WIP] Feat: Treesitter based markdown parsing Jan 4, 2026
@YousefHadder
Copy link
Owner

Thank you for initiating this change I really appreciate it.

One note is that currently, TreeSitter utilities are in lua/markdown-plus/format/treesitter.lua, but non-format modules (lists, headers, footnotes) are importing from it. This creates confusing dependencies - why is the list parser importing from the format module?

I'd like to refactor the shared TreeSitter utilities into a dedicated module before extending to other features.

Proposed Structure

lua/markdown-plus/
├── treesitter/
│   └── init.lua              # Shared TS utilities (new)
│       ├── M.nodes           # Node type constants
│       ├── is_available()
│       ├── get_parser()
│       ├── get_node_at_cursor()
│       ├── get_node_at_position()
│       ├── find_ancestor()
│       ├── is_row_in_node_type()
│       ├── get_lines_in_node_type()
│       └── is_in_fenced_code_block()
│
├── format/
│   └── treesitter.lua        # Format-specific TS functions
│       ├── get_formatting_node_at_cursor()
│       ├── get_any_format_at_cursor()
│       └── remove_formatting_from_node()
│
├── list/parser.lua           # imports "markdown-plus.treesitter"
├── headers/parser.lua        # imports "markdown-plus.treesitter"
├── footnotes/parser.lua      # imports "markdown-plus.treesitter"
└── utils.lua                 # imports "markdown-plus.treesitter"

This way:

  • Shared TS utilities live in treesitter/init.lua
  • Format-specific TS functions stay in format/treesitter.lua
  • All modules import from the appropriate location
  • Easier to extend to other features (links, images, tables, quotes)

I noticed you plan to open an issue for tracking this, I really appreciate if you can consider the idea of a shared TS module for this refactoring.

- Add centralized debug logging utility (ts.log) for treesitter operations
- Fix infinite loop caused by aggressive parser:invalidate() on every call
- Query tree directly with named_descendant_for_range instead of vim.treesitter.get_node
- Add proper type annotations (TSNode, vim.treesitter.LanguageTree)
- Add centralized node type constants (M.nodes) to avoid hardcoded strings
- Update documentation with treesitter integration section and troubleshooting
- Improve logging to distinguish expected fallbacks (letter lists) from unexpected ones

Debug logging can be enabled with:
  require('markdown-plus.format.treesitter').debug = true
@YousefHadder YousefHadder marked this pull request as ready for review February 8, 2026 18:41
Copilot AI review requested due to automatic review settings February 8, 2026 18:42
@YousefHadder YousefHadder changed the title [WIP] Feat: Treesitter based markdown parsing feat(treesitter): Treesitter based markdown parsing Feb 8, 2026
@YousefHadder YousefHadder changed the title feat(treesitter): Treesitter based markdown parsing feat(treesitter): treesitter based markdown parsing Feb 8, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors markdown parsing to use a shared Tree-sitter utility module for more accurate detection (with regex fallback), and updates list/header/footnote logic to benefit from Tree-sitter where available.

Changes:

  • Added a shared markdown-plus.treesitter module and migrated code-block/header detection to use it with regex fallback.
  • Updated list parsing API to accept a row number and prefer Tree-sitter parsing when possible.
  • Adjusted specs and multiple list handlers/utilities to pass row numbers for Tree-sitter-aware parsing.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
spec/markdown-plus/list_spec.lua Updates tests to parse via buffer+row to exercise the new row-aware list parsing path.
lua/markdown-plus/utils.lua Switches code-block detection to the new shared Tree-sitter module.
lua/markdown-plus/treesitter/init.lua Introduces shared Tree-sitter helpers (parser access, node lookup, fenced code block helpers).
lua/markdown-plus/table/conversion.lua Treats ~~~ as a code fence in CSV/table heuristics.
lua/markdown-plus/list/shared.lua Passes row numbers into list parsing for Tree-sitter support.
lua/markdown-plus/list/renumber.lua Passes row numbers into list parsing for Tree-sitter support.
lua/markdown-plus/list/parser.lua Adds Tree-sitter-driven list parsing (with regex fallback) and changes the parse API to accept row.
lua/markdown-plus/list/handlers.lua Updates handlers to pass cursor row into list parsing and clarifies code-block skipping wrapper docs.
lua/markdown-plus/list/checkbox.lua Passes line_num into list parsing for Tree-sitter support.
lua/markdown-plus/headers/parser.lua Uses Tree-sitter line sets to exclude fenced code blocks when collecting headers (regex fallback retained).
lua/markdown-plus/format/treesitter.lua Delegates Tree-sitter node retrieval and fenced code block detection to the shared module.
lua/markdown-plus/format/patterns.lua Updates comments around Tree-sitter node types (but contains an outdated reference).
lua/markdown-plus/footnotes/parser.lua Uses Tree-sitter line sets to exclude fenced code blocks (regex fallback retained).

Copilot AI review requested due to automatic review settings February 8, 2026 18:53
@YousefHadder
Copy link
Owner

Performance: hybrid TS + regex strategy

I benchmarked treesitter vs regex across different buffer sizes (100–2000 lines, 1000 iterations each) and found that TS is not universally faster:

Operation Winner Why
Code block detection at cursor TS wins (7-11x faster at 500+ lines) TS does a constant-time ancestor walk; regex must scan every line from top of file to cursor
Full-buffer code block collection Regex wins (55-60x faster) parse(true) + recursive Lua tree walk is expensive; regex is a simple string.match loop
Single-line list parsing Regex wins (10-15x faster) One string.match call vs TS node lookup + ancestor walk
Header scanning (full buffer) Regex wins (38-44x faster) Same bottleneck as full-buffer code block collection

The cost comes from get_parser() calling parse(true) on every invocation and the Lua-side recursive tree walk. For operations that scan the entire buffer, regex is dramatically faster because it's just iterating lines with string.match.

Based on these numbers, I've switched to a hybrid approach in the latest commits:

  • TS for point queriesis_in_fenced_code_block() (called on every Enter/Tab keypress), format node detection, is_row_in_node_type(). These are the cases where TS genuinely outperforms regex.
  • Regex for full-buffer scansget_all_headers() and footnote code block exclusion now use regex directly instead of routing through ts.get_lines_in_node_type().

The value of treesitter in this PR is accuracy (correct code block detection that regex can't match), not speed. The hybrid approach gives us both: correctness from TS where it matters, and performance from regex where TS is slower.

@YousefHadder
Copy link
Owner

@ten3roberts let me know what you think of the recent changes and the above comment.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

lua/markdown-plus/list/parser.lua:277

  • parse_list_line falls back to regex whenever the treesitter attempt returns nil. That means if treesitter is available but determines the row is not a list item (e.g., inside fenced code blocks or other constructs TS parses differently), regex can still misclassify the line as a list—undermining the stated goal of using TS for more accurate parsing. Consider making TS authoritative when available (e.g., only fall back to regex when TS is unavailable, or when TS positively identifies a list_item but marker parsing needs regex such as letter markers).
---Parse a line to detect list information
---Uses treesitter when row is provided and available, falls back to regex
---@param line string Line to parse
---@param row number 1-indexed row for treesitter
---@return markdown-plus.ListInfo|nil List info or nil if not a list
function M.parse_list_line(line, row)
  if not line then
    return nil
  end

  -- Try treesitter first (if row provided)
  local ts_result = row and parse_list_line_ts(row) or nil
  if ts_result then
    return ts_result
  end

  -- Fall through to regex if ts returns nil
  -- (handles letter lists, ts unavailable, continuation lines, etc.)

  -- Fallback to regex
  return parse_list_line_regex(line)
end

Copilot AI review requested due to automatic review settings February 8, 2026 19:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 9, 2026 00:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

lua/markdown-plus/treesitter/init.lua:165

  • The docstring says this “efficiently queries all nodes”, but the implementation recursively walks the entire syntax tree. Either adjust the comment to match the behavior, or switch to a TS query-based approach if you want the “query” efficiency claim to be accurate.
---Get set of line numbers inside nodes of a specific type
---Efficiently queries all nodes of the type and collects their line ranges
---@param node_type string Node type to find (e.g. M.nodes.FENCED_CODE_BLOCK)
---@return table<number, boolean>|nil Line number set (1-indexed), or nil if ts unavailable
function M.get_lines_in_node_type(node_type)
  local parser = M.get_parser()

Copilot AI review requested due to automatic review settings February 10, 2026 09:57
@ten3roberts
Copy link
Contributor Author

Thank you. I think they make sense

What did you use for testing the perf?

And I agree, TS advantage is accuracy (especially for e.g; C++ which has syntax that is non-parseable for regex with function signature declarations for instance), and the iterative nature of it. Startup costs/0 costs are higher

So a hybrid approach is best

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

@YousefHadder
Copy link
Owner

@ten3roberts I used a performance test written by copilot to measure and compare performance of both.

Copilot AI review requested due to automatic review settings February 12, 2026 03:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments