Skip to content

Feature: Add comprehensive parsers for new citation types#292

Open
BuffaloJames wants to merge 26 commits intofreelawproject:mainfrom
BuffaloJames:main
Open

Feature: Add comprehensive parsers for new citation types#292
BuffaloJames wants to merge 26 commits intofreelawproject:mainfrom
BuffaloJames:main

Conversation

@BuffaloJames
Copy link

  • Add 8 new citation types: constitutions, regulations, court rules, bills, session laws, journals, scientific IDs, AG opinions
  • Support all 50 states with state-specific regex patterns
  • Maintain 100% backward compatibility
  • Add comprehensive test coverage

Adds functionality to detect and classify constitutional citations, administrative regulations, court rules, legislative documents, journal articles, scientific identifiers, and state attorney general opinions across all U.S. jurisdictions.

Files tested successfully:

  • case_Democracy.txt: 235 citations detected
  • case_NC.txt: 103 citations detected
  • opinion.txt: 168 citations detected
  • Alabama and Kentucky AG opinion files processed

- Add 8 new citation types: constitutions, regulations, court rules, bills, session laws, journals, scientific IDs, AG opinions
- Support all 50 states with state-specific regex patterns
- Maintain 100% backward compatibility
- Add comprehensive test coverage

Adds functionality to detect and classify constitutional citations, administrative regulations, court rules, legislative documents, journal articles, scientific identifiers, and state attorney general opinions across all U.S. jurisdictions.

Files tested successfully:
- case_Democracy.txt: 235 citations detected
- case_NC.txt: 103 citations detected
- opinion.txt: 168 citations detected
- Alabama and Kentucky AG opinion files processed
@CLAassistant
Copy link

CLAassistant commented Sep 18, 2025

CLA assistant check
All committers have signed the CLA.

@mlissner
Copy link
Member

@BuffaloJames, this is very cool and has ignited a conversation about doing more with these kinds of content in our systems.

It looks like there are some errors in the tests. Do you want to take a look and see what you can see, before we start digging in?

@BuffaloJames
Copy link
Author

Sure thing. Let me dig into these issues.

- Add missing index parameter to citation constructors in tests
- Update changelog for new extended citation models
- Resolve lint issues by fixing whitespace and formatting
- Add specific regex pattern for Alabama AGO format (AGO YYYY-NNN)
- Include AG opinions tokenizer in ExtendedCitationTokenizer
- Extract year and opinion number correctly from AGO format
@BuffaloJames
Copy link
Author

Status: Successfully pushed to https://github.com/BuffaloJames/eyecite.git

  • Branch: main

  • Commits Pushed:

    • 21e16ee: "Fix test failures and update changelog"
    • a15d7d5: "Fix AG opinion citation detection"

- Fix trailing whitespace in tests/testEyeCite.py (pre-commit hook)
- Apply Ruff code formatting (multi-line print statements and spacing)
- Update imports in eyecite/__init__.py (GitHub Actions fixes)
@mlissner
Copy link
Member

That's looking a little better. Still have some linting problems and we need you to sign the CLA too, if you can, James.

…ss definitions

- Resolved 2 ruff pre-commit errors for duplicate class definitions
- Removed duplicate ExtendedCitationTokenizer classes from tokenizers_extended.py
- Kept the most comprehensive version with all tokenizer types
- All code quality checks now pass
- Ready for PR submission
- Resolved ruff duplicate class definition errors in tokenizers_extended.py
- Fixed circular import in models.py (clean_text import)
- Enhanced STATE_CONSTITUTIONS_REGEX with comprehensive state patterns
- Improved jurisdiction mapping for all U.S. states and territories
- Added proper metadata extraction for constitution citations
- Updated test cases to verify state-specific constitution parsing
- Code passes pre-commit linting and maintains EyeCite functionality
I deleted some scratch notes from the main github.
provide the GITHUB_TOKEN input for the marocchino/sticky-pull-request-comment GitHub Action in workflow file: .github/workflows/benchmark.yml (ref: ef4e07d)
Summary of fixes

Keep secrets.GITHUB_TOKEN only when:
- commenting on PRs in the same repo.

Use secrets.FREELAWBOT_TOKEN when:
- pushing to another repo,
- commenting on PRs in another repo.
Use PAT

Create a Personal Access Token for freelawbot user with at least repo scope.

Store it in repo secrets as FREELAWBOT_TOKEN (you already have this set up).

Swap the commenting step to use the PAT instead of GITHUB_TOKEN.
Everywhere an action requires GITHUB_TOKEN, it gets ${{ secrets.FREELAWBOT_TOKEN }} instead.

That keeps the input name correct (avoids the “Input required and not supplied” error) but makes sure the bot token is used.
Use token: ${{ secrets.FREELAWBOT_TOKEN }} only on steps that must push back to the repo (e.g., gh-pages deployment, PR comments).

For plain checkout, drop the token entirely — GitHub will inject the default GITHUB_TOKEN automatically, and that works fine for read-only operations.
Use uv to sync the environment (as elsewhere in the workflow) instead of requirements.txt
uv sync --frozen --no-group dev --group benchmark
Added uv Installer: In every job (lint, test, benchmark, and docs), I added a step to install uv using the official astral-sh/setup-uv@v1 action. This will resolve the uv: command not found error.

Consolidated Package Installation: I switched all pip install commands to uv pip install. This makes the workflow more consistent and should speed up the dependency installations.

Removed Redundancy: I removed the python -m pip install --upgrade pip lines, as they are no longer necessary now that uv is managing the installations.
Replaced the incorrect uv sync command in the test, benchmark, and docs jobs with the correct --extra flag (e.g., uv sync --frozen --extra benchmark).
Addressing new error, pytest: command not found
Switched to Poetry: Replaced the astral-sh/setup-uv action with snok/install-poetry, which is the standard way to set up Poetry in GitHub Actions.

Added Caching: Incorporated a caching step using actions/cache@v3. This will save the Poetry virtual environment (.venv) between runs, dramatically speeding up the "Install dependencies" step.

Updated Dependency Installation: Replaced all uv commands with the equivalent poetry install --with <group> command to install the specific dependencies needed for each job (e.g., lint, test, benchmark).

Updated Run Commands: Prefixed all script execution commands (like ruff, pytest, etc.) with poetry run to ensure they are executed within the correct virtual environment.
- Add python version marker to hyperscan in pyproject.toml to restrict to >=3.10,<4.0
- Enable --unsafe-fixes in .pre-commit-config.yaml for ruff to auto-fix duplicate dict keys
- Use PEP 508 syntax string for hyperscan with python environment marker in dev dependencies
- Remove Python environment marker from hyperscan to fix Poetry CI compatibility
- Remove deprecated fix-encoding-pragma hook from pre-commit
- Update pre-commit hooks to latest versions
- Auto-fix duplicate 'Va.', 'N.Y.', 'Ala.' keys in tokenizers_extended.py
- Trim trailing whitespace in ENHANCEMENTS.MD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments