Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Feb 1, 2026

Summary

Adds a complete preprocessing pipeline for the PyraNet-Verilog dataset, including dataset extraction, local CSV caching, Tree-sitter–based syntax highlighting, and batch-validated generation of byte-aligned highlight masks.

Key changes

Cache PyraNet-Verilog locally as CSV to avoid repeated downloads

Auto-download highlights.scm if missing

Add a robust Tree-sitter Verilog highlighter that filters incompatible query rules and guarantees byte-exact output

Add batch conversion script to generate highlighted/ts_.v, validate lengths, and report success/failure stats

Update README to document the full workflow

This enables reproducible, alignment-safe preprocessing for downstream analysis and ML training.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds utilities for working with the PyraNet-Verilog dataset from Hugging Face. The PR introduces a complete workflow for downloading Verilog code samples and generating tree-sitter syntax highlighting annotations aligned byte-for-byte with the original source files.

Changes:

  • Adds dataset download and extraction scripts with CSV caching support
  • Implements tree-sitter-based syntax highlighting with robust query filtering
  • Provides batch processing utilities with length validation and error handling

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
data/pyrranet-verilog/README.md Documentation for dataset utilities and tree-sitter highlighting workflow
data/pyrranet-verilog/get_dataset.sh Bash wrapper script to download highlights.scm and invoke organize_datasets.py
data/pyrranet-verilog/organize_datasets.py Python script to download PyraNet-Verilog dataset with CSV caching
data/pyrranet-verilog/verilog_ts_colorize.py Core utility for tree-sitter parsing and per-byte syntax highlighting
data/pyrranet-verilog/make_highlighted.py Batch processor for converting Verilog files to highlighted versions
data/pyrranet-verilog/highlights.scm Tree-sitter query file for Verilog syntax highlighting rules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return names


def language_field_names(lang: Language) -> set[str]:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function uses Python 3.9+ syntax (set[str]) for type hints without proper future import. While the file has 'from future import annotations' at line 27, it imports after the other imports. Either ensure consistency with annotations usage or use 'Set[str]' from typing module.

Copilot uses AI. Check for mistakes.
return sexps


def sexp_mentions_unknown_syntax(sexp: str, known_nodes: set[str], known_fields: set[str]) -> bool:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function uses Python 3.9+ syntax (set[str]) for type hints without proper future import. To ensure compatibility with Python 3.8 and earlier, either add 'from future import annotations' at the top of the file, or use 'Set[str]' and import Set from typing.

Copilot uses AI. Check for mistakes.
# QueryCursor in tree_sitter 0.25.x exists, but constructor/signatures vary.
try:
from tree_sitter import QueryCursor
except Exception:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bare 'except Exception:' clause is overly broad. It would be better to catch specific exceptions like ImportError or ModuleNotFoundError that might occur when QueryCursor is unavailable in older versions of tree-sitter.

Suggested change
except Exception:
except (ImportError, ModuleNotFoundError):

Copilot uses AI. Check for mistakes.
try:
_ = Query(lang, sexp)
kept.append(sexp)
except Exception as e:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bare 'except Exception:' clause is overly broad. Tree-sitter Query compilation errors are typically specific exceptions. Consider catching more specific exception types to avoid masking unexpected errors.

Suggested change
except Exception as e:
except (ValueError, TypeError, RuntimeError) as e:

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4
Here’s a rewritten **README.md** that matches what you actually have now: `get_dataset.sh`, `organize_datasets.py`, `verilog_ts_colorize.py`, and the Tree-sitter highlight workflow (including the “filter the Neovim query file to what the Python grammar supports” behavior).

You can paste this over your current `README.md`.

Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README begins with meta-commentary that should not be included in the final documentation. These lines (1-4) appear to be leftover from a conversation or draft and should be removed. The actual README content should start from line 5 with the markdown header.

Copilot uses AI. Check for mistakes.
if tmp.exists():
try:
tmp.unlink()
except OSError:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if tmp.exists():
try:
tmp.unlink()
except OSError:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if tmp.exists():
try:
tmp.unlink()
except OSError:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if tmp.exists():
try:
tmp.unlink()
except OSError:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if tmp.exists():
try:
tmp.unlink()
except OSError:
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant