-
Notifications
You must be signed in to change notification settings - Fork 28
Add verilog parallel datasets #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds utilities for working with the PyraNet-Verilog dataset from Hugging Face. The PR introduces a complete workflow for downloading Verilog code samples and generating tree-sitter syntax highlighting annotations aligned byte-for-byte with the original source files.
Changes:
- Adds dataset download and extraction scripts with CSV caching support
- Implements tree-sitter-based syntax highlighting with robust query filtering
- Provides batch processing utilities with length validation and error handling
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| data/pyrranet-verilog/README.md | Documentation for dataset utilities and tree-sitter highlighting workflow |
| data/pyrranet-verilog/get_dataset.sh | Bash wrapper script to download highlights.scm and invoke organize_datasets.py |
| data/pyrranet-verilog/organize_datasets.py | Python script to download PyraNet-Verilog dataset with CSV caching |
| data/pyrranet-verilog/verilog_ts_colorize.py | Core utility for tree-sitter parsing and per-byte syntax highlighting |
| data/pyrranet-verilog/make_highlighted.py | Batch processor for converting Verilog files to highlighted versions |
| data/pyrranet-verilog/highlights.scm | Tree-sitter query file for Verilog syntax highlighting rules |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return names | ||
|
|
||
|
|
||
| def language_field_names(lang: Language) -> set[str]: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function uses Python 3.9+ syntax (set[str]) for type hints without proper future import. While the file has 'from future import annotations' at line 27, it imports after the other imports. Either ensure consistency with annotations usage or use 'Set[str]' from typing module.
| return sexps | ||
|
|
||
|
|
||
| def sexp_mentions_unknown_syntax(sexp: str, known_nodes: set[str], known_fields: set[str]) -> bool: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function uses Python 3.9+ syntax (set[str]) for type hints without proper future import. To ensure compatibility with Python 3.8 and earlier, either add 'from future import annotations' at the top of the file, or use 'Set[str]' and import Set from typing.
| # QueryCursor in tree_sitter 0.25.x exists, but constructor/signatures vary. | ||
| try: | ||
| from tree_sitter import QueryCursor | ||
| except Exception: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare 'except Exception:' clause is overly broad. It would be better to catch specific exceptions like ImportError or ModuleNotFoundError that might occur when QueryCursor is unavailable in older versions of tree-sitter.
| except Exception: | |
| except (ImportError, ModuleNotFoundError): |
| try: | ||
| _ = Query(lang, sexp) | ||
| kept.append(sexp) | ||
| except Exception as e: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare 'except Exception:' clause is overly broad. Tree-sitter Query compilation errors are typically specific exceptions. Consider catching more specific exception types to avoid masking unexpected errors.
| except Exception as e: | |
| except (ValueError, TypeError, RuntimeError) as e: |
| Here’s a rewritten **README.md** that matches what you actually have now: `get_dataset.sh`, `organize_datasets.py`, `verilog_ts_colorize.py`, and the Tree-sitter highlight workflow (including the “filter the Neovim query file to what the Python grammar supports” behavior). | ||
|
|
||
| You can paste this over your current `README.md`. | ||
|
|
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README begins with meta-commentary that should not be included in the final documentation. These lines (1-4) appear to be leftover from a conversation or draft and should be removed. The actual README content should start from line 5 with the markdown header.
| if tmp.exists(): | ||
| try: | ||
| tmp.unlink() | ||
| except OSError: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| if tmp.exists(): | ||
| try: | ||
| tmp.unlink() | ||
| except OSError: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| if tmp.exists(): | ||
| try: | ||
| tmp.unlink() | ||
| except OSError: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| if tmp.exists(): | ||
| try: | ||
| tmp.unlink() | ||
| except OSError: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| if tmp.exists(): | ||
| try: | ||
| tmp.unlink() | ||
| except OSError: |
Copilot
AI
Feb 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
Summary
Adds a complete preprocessing pipeline for the PyraNet-Verilog dataset, including dataset extraction, local CSV caching, Tree-sitter–based syntax highlighting, and batch-validated generation of byte-aligned highlight masks.
Key changes
Cache PyraNet-Verilog locally as CSV to avoid repeated downloads
Auto-download highlights.scm if missing
Add a robust Tree-sitter Verilog highlighter that filters incompatible query rules and guarantees byte-exact output
Add batch conversion script to generate highlighted/ts_.v, validate lengths, and report success/failure stats
Update README to document the full workflow
This enables reproducible, alignment-safe preprocessing for downstream analysis and ML training.