feat(treesitter): treesitter based markdown parsing by ten3roberts · Pull Request #201 · YousefHadder/markdown-plus.nvim

ten3roberts · 2026-01-04T10:34:26Z

Description

This PR supplement the existing markdown structure parsing which uses string.match and regex-like parsing to utilizing treesitter for more accurate parsing, and less error prone parsing of e.g; commented lists, nested codeblock syntax and similar which can throw off regex based parsing as they can't always handle complex escaping.

If the TS parsing is not available, it falls back to the previous regex parsing

Type of Change

Related Issues

Fixes [... TODO: open issue]

Testing

Tested manually (using myself, and testing with tests and an example md file)
[-] Added/updated tests (if applicable)

Checklist

Remaining Items before finished

YousefHadder · 2026-01-04T21:16:38Z

Thank you for initiating this change I really appreciate it.

One note is that currently, TreeSitter utilities are in lua/markdown-plus/format/treesitter.lua, but non-format modules (lists, headers, footnotes) are importing from it. This creates confusing dependencies - why is the list parser importing from the format module?

I'd like to refactor the shared TreeSitter utilities into a dedicated module before extending to other features.

Proposed Structure

lua/markdown-plus/
├── treesitter/
│   └── init.lua              # Shared TS utilities (new)
│       ├── M.nodes           # Node type constants
│       ├── is_available()
│       ├── get_parser()
│       ├── get_node_at_cursor()
│       ├── get_node_at_position()
│       ├── find_ancestor()
│       ├── is_row_in_node_type()
│       ├── get_lines_in_node_type()
│       └── is_in_fenced_code_block()
│
├── format/
│   └── treesitter.lua        # Format-specific TS functions
│       ├── get_formatting_node_at_cursor()
│       ├── get_any_format_at_cursor()
│       └── remove_formatting_from_node()
│
├── list/parser.lua           # imports "markdown-plus.treesitter"
├── headers/parser.lua        # imports "markdown-plus.treesitter"
├── footnotes/parser.lua      # imports "markdown-plus.treesitter"
└── utils.lua                 # imports "markdown-plus.treesitter"

This way:

Shared TS utilities live in treesitter/init.lua
Format-specific TS functions stay in format/treesitter.lua
All modules import from the appropriate location
Easier to extend to other features (links, images, tables, quotes)

I noticed you plan to open an issue for tracking this, I really appreciate if you can consider the idea of a shared TS module for this refactoring.

- Add centralized debug logging utility (ts.log) for treesitter operations - Fix infinite loop caused by aggressive parser:invalidate() on every call - Query tree directly with named_descendant_for_range instead of vim.treesitter.get_node - Add proper type annotations (TSNode, vim.treesitter.LanguageTree) - Add centralized node type constants (M.nodes) to avoid hardcoded strings - Update documentation with treesitter integration section and troubleshooting - Improve logging to distinguish expected fallbacks (letter lists) from unexpected ones Debug logging can be enabled with: require('markdown-plus.format.treesitter').debug = true

Copilot

Pull request overview

Refactors markdown parsing to use a shared Tree-sitter utility module for more accurate detection (with regex fallback), and updates list/header/footnote logic to benefit from Tree-sitter where available.

Changes:

Added a shared markdown-plus.treesitter module and migrated code-block/header detection to use it with regex fallback.
Updated list parsing API to accept a row number and prefer Tree-sitter parsing when possible.
Adjusted specs and multiple list handlers/utilities to pass row numbers for Tree-sitter-aware parsing.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
spec/markdown-plus/list_spec.lua	Updates tests to parse via buffer+row to exercise the new row-aware list parsing path.
lua/markdown-plus/utils.lua	Switches code-block detection to the new shared Tree-sitter module.
lua/markdown-plus/treesitter/init.lua	Introduces shared Tree-sitter helpers (parser access, node lookup, fenced code block helpers).
lua/markdown-plus/table/conversion.lua	Treats `~~~` as a code fence in CSV/table heuristics.
lua/markdown-plus/list/shared.lua	Passes row numbers into list parsing for Tree-sitter support.
lua/markdown-plus/list/renumber.lua	Passes row numbers into list parsing for Tree-sitter support.
lua/markdown-plus/list/parser.lua	Adds Tree-sitter-driven list parsing (with regex fallback) and changes the parse API to accept row.
lua/markdown-plus/list/handlers.lua	Updates handlers to pass cursor row into list parsing and clarifies code-block skipping wrapper docs.
lua/markdown-plus/list/checkbox.lua	Passes `line_num` into list parsing for Tree-sitter support.
lua/markdown-plus/headers/parser.lua	Uses Tree-sitter line sets to exclude fenced code blocks when collecting headers (regex fallback retained).
lua/markdown-plus/format/treesitter.lua	Delegates Tree-sitter node retrieval and fenced code block detection to the shared module.
lua/markdown-plus/format/patterns.lua	Updates comments around Tree-sitter node types (but contains an outdated reference).
lua/markdown-plus/footnotes/parser.lua	Uses Tree-sitter line sets to exclude fenced code blocks (regex fallback retained).

lua/markdown-plus/list/parser.lua

lua/markdown-plus/treesitter/init.lua

lua/markdown-plus/format/patterns.lua

lua/markdown-plus/list/parser.lua

YousefHadder · 2026-02-08T18:55:00Z

Performance: hybrid TS + regex strategy

I benchmarked treesitter vs regex across different buffer sizes (100–2000 lines, 1000 iterations each) and found that TS is not universally faster:

Operation	Winner	Why
Code block detection at cursor	TS wins (7-11x faster at 500+ lines)	TS does a constant-time ancestor walk; regex must scan every line from top of file to cursor
Full-buffer code block collection	Regex wins (55-60x faster)	`parse(true)` + recursive Lua tree walk is expensive; regex is a simple `string.match` loop
Single-line list parsing	Regex wins (10-15x faster)	One `string.match` call vs TS node lookup + ancestor walk
Header scanning (full buffer)	Regex wins (38-44x faster)	Same bottleneck as full-buffer code block collection

The cost comes from get_parser() calling parse(true) on every invocation and the Lua-side recursive tree walk. For operations that scan the entire buffer, regex is dramatically faster because it's just iterating lines with string.match.

Based on these numbers, I've switched to a hybrid approach in the latest commits:

TS for point queries — is_in_fenced_code_block() (called on every Enter/Tab keypress), format node detection, is_row_in_node_type(). These are the cases where TS genuinely outperforms regex.
Regex for full-buffer scans — get_all_headers() and footnote code block exclusion now use regex directly instead of routing through ts.get_lines_in_node_type().

The value of treesitter in this PR is accuracy (correct code block detection that regex can't match), not speed. The hybrid approach gives us both: correctness from TS where it matters, and performance from regex where TS is slower.

YousefHadder · 2026-02-08T18:55:58Z

@ten3roberts let me know what you think of the recent changes and the above comment.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

lua/markdown-plus/list/parser.lua:277

parse_list_line falls back to regex whenever the treesitter attempt returns nil. That means if treesitter is available but determines the row is not a list item (e.g., inside fenced code blocks or other constructs TS parses differently), regex can still misclassify the line as a list—undermining the stated goal of using TS for more accurate parsing. Consider making TS authoritative when available (e.g., only fall back to regex when TS is unavailable, or when TS positively identifies a list_item but marker parsing needs regex such as letter markers).

---Parse a line to detect list information
---Uses treesitter when row is provided and available, falls back to regex
---@param line string Line to parse
---@param row number 1-indexed row for treesitter
---@return markdown-plus.ListInfo|nil List info or nil if not a list
function M.parse_list_line(line, row)
  if not line then
    return nil
  end

  -- Try treesitter first (if row provided)
  local ts_result = row and parse_list_line_ts(row) or nil
  if ts_result then
    return ts_result
  end

  -- Fall through to regex if ts returns nil
  -- (handles letter lists, ts unavailable, continuation lines, etc.)

  -- Fallback to regex
  return parse_list_line_regex(line)
end

lua/markdown-plus/footnotes/parser.lua

lua/markdown-plus/list/parser.lua

spec/markdown-plus/list_spec.lua

lua/markdown-plus/headers/parser.lua

…nces

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

lua/markdown-plus/list/renumber.lua

lua/markdown-plus/treesitter/init.lua

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

lua/markdown-plus/treesitter/init.lua:165

The docstring says this “efficiently queries all nodes”, but the implementation recursively walks the entire syntax tree. Either adjust the comment to match the behavior, or switch to a TS query-based approach if you want the “query” efficiency claim to be accurate.

---Get set of line numbers inside nodes of a specific type
---Efficiently queries all nodes of the type and collects their line ranges
---@param node_type string Node type to find (e.g. M.nodes.FENCED_CODE_BLOCK)
---@return table<number, boolean>|nil Line number set (1-indexed), or nil if ts unavailable
function M.get_lines_in_node_type(node_type)
  local parser = M.get_parser()

lua/markdown-plus/list/parser.lua

lua/markdown-plus/headers/parser.lua

…factor/ts-parse-utils

ten3roberts · 2026-02-10T10:01:04Z

Thank you. I think they make sense

What did you use for testing the perf?

And I agree, TS advantage is accuracy (especially for e.g; C++ which has syntax that is non-parseable for regex with function signature declarations for instance), and the iterative nature of it. Startup costs/0 costs are higher

So a hybrid approach is best

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

doc/markdown-plus.txt

lua/markdown-plus/list/parser.lua

lua/markdown-plus/treesitter/init.lua

lua/markdown-plus/format/treesitter.lua

lua/markdown-plus/list/parser.lua

YousefHadder · 2026-02-10T17:24:59Z

@ten3roberts I used a performance test written by copilot to measure and compare performance of both.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

doc/markdown-plus.txt

lua/markdown-plus/format/treesitter.lua

lua/markdown-plus/treesitter/init.lua

lua/markdown-plus/list/parser.lua

ten3roberts added 2 commits January 4, 2026 11:24

feat: treesitter parsing

b6a1dbc

refactor: constants definition

f35c7ba

ten3roberts changed the title ~~Feat: Treesitter based markdown parsing~~ [WIP] Feat: Treesitter based markdown parsing Jan 4, 2026

ten3roberts added 2 commits January 23, 2026 21:01

refactor: extract shared treesitter utilities into dedicated module

43a8ebf

YousefHadder marked this pull request as ready for review February 8, 2026 18:41

Copilot AI review requested due to automatic review settings February 8, 2026 18:42

YousefHadder added 2 commits February 8, 2026 12:42

refactor(treesitter): extract shared module from format/treesitter

2ba0340

fix(treesitter): use ignore_injections for code block detection

3a2a967

YousefHadder changed the title ~~[WIP] Feat: Treesitter based markdown parsing~~ feat(treesitter): Treesitter based markdown parsing Feb 8, 2026

YousefHadder changed the title ~~feat(treesitter): Treesitter based markdown parsing~~ feat(treesitter): treesitter based markdown parsing Feb 8, 2026

Merge branch 'main' into refactor/ts-parse-utils

66ef589

Copilot AI reviewed Feb 8, 2026

View reviewed changes

YousefHadder added 2 commits February 8, 2026 12:47

fix(list): guard against nil row and non-numeric markers in TS path

8ee3c43

perf: use regex for full-buffer code block scanning

56f7787

Copilot AI review requested due to automatic review settings February 8, 2026 18:53

Copilot started reviewing on behalf of YousefHadder February 8, 2026 18:54 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

YousefHadder added 2 commits February 8, 2026 13:01

docs: fix optional param annotation and remove stale benchmark refere…

753e58d

…nces

Merge branch 'main' into refactor/ts-parse-utils

2eb1e2b

Copilot AI review requested due to automatic review settings February 8, 2026 19:09

Copilot started reviewing on behalf of YousefHadder February 8, 2026 19:09 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

lua/markdown-plus/list/renumber.lua Outdated Show resolved Hide resolved

lua/markdown-plus/treesitter/init.lua Show resolved Hide resolved

lua/markdown-plus/treesitter/init.lua Show resolved Hide resolved

copilot suggestion

d34da3b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 9, 2026 00:31

Copilot started reviewing on behalf of YousefHadder February 9, 2026 00:32 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

lua/markdown-plus/list/parser.lua Show resolved Hide resolved

lua/markdown-plus/headers/parser.lua Show resolved Hide resolved

ten3roberts added 2 commits February 10, 2026 10:24

Merge remote-tracking branch 'origin/refactor/ts-parse-utils' into re…

6dfc195

…factor/ts-parse-utils

chore: continue refactor after merge and cleanup

3c8a99d

Copilot AI review requested due to automatic review settings February 10, 2026 09:57

Copilot started reviewing on behalf of ten3roberts February 10, 2026 09:58 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

YousefHadder added 2 commits February 11, 2026 21:21

fix(treesitter): address review comments on PR YousefHadder#201

3ebbbeb

remove tags file

e46347f

Copilot AI review requested due to automatic review settings February 12, 2026 03:27

Copilot started reviewing on behalf of YousefHadder February 12, 2026 03:27 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

doc/markdown-plus.txt Outdated Show resolved Hide resolved

lua/markdown-plus/format/treesitter.lua Show resolved Hide resolved

lua/markdown-plus/treesitter/init.lua Show resolved Hide resolved

lua/markdown-plus/list/parser.lua Show resolved Hide resolved

fix(treesitter): correct neovim version in docs and forward debug flag

abbf555

Conversation

ten3roberts commented Jan 4, 2026

Description

Type of Change

Related Issues

Testing

Checklist

Remaining Items before finished

Uh oh!

YousefHadder commented Jan 4, 2026

Proposed Structure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YousefHadder commented Feb 8, 2026

Performance: hybrid TS + regex strategy

Uh oh!

YousefHadder commented Feb 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

ten3roberts commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YousefHadder commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants