-
Notifications
You must be signed in to change notification settings - Fork 6
Performance: Phase 2 - Data Processing & Metadata Optimizations #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dshkol
wants to merge
9
commits into
master
Choose a base branch
from
performance/phase2-optimizations
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
be898ff
perf: Optimize database operations for significant performance gains
dshkol 9409d9c
perf: Add metadata caching and adaptive chunk sizing optimizations
dshkol eeb8759
docs: Add comprehensive performance benchmarking and validation
dshkol 6744292
docs: Add comprehensive code review of performance optimizations
dshkol 6793188
resolve some test issues
mountainMath b694681
update rbuildignore
mountainMath 0d2b20b
docs: Add persistent agent knowledge base
dshkol 7e9150e
perf: Phase 2 performance optimizations (coordinate, date, factor, hiβ¦
dshkol bc09129
docs: Update NEWS and agent knowledge base with Phase 2 optimizations
dshkol File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,315 @@ | ||
| # Agent Knowledge Base for cansim Package | ||
|
|
||
| This document captures learnings and conventions for AI agents working on the cansim R package. | ||
|
|
||
| Last updated: 2025-11-15 | ||
|
|
||
| --- | ||
|
|
||
| ## Project Overview | ||
|
|
||
| **Package**: cansim - R package for accessing Statistics Canada data tables | ||
| **Language**: R | ||
| **Main Focus**: Database operations (SQLite, Parquet, Feather) | ||
| **Testing**: testthat 3.0+ | ||
| **Maintainers**: Jens von Bergmann, Dmitry Shkolnik | ||
|
|
||
| --- | ||
|
|
||
| ## Key Technical Learnings | ||
|
|
||
| ### Database Schema Conventions | ||
|
|
||
| **SQLite Table Naming**: | ||
| - β Tables are NOT named "data" | ||
| - β Tables are named `cansim_{table_number}` (e.g., `cansim_23_10_0061`) | ||
| - Always detect table names dynamically: | ||
| ```r | ||
| tbl <- DBI::dbListTables(con$src$con) | ||
| tbl <- tbl[grepl("^cansim", tbl)] | ||
| ``` | ||
|
|
||
| **Field Names**: | ||
| - Can vary by language (English/French) | ||
| - Use actual schema inspection, not assumptions | ||
| - Geography fields have special handling with normalization | ||
|
|
||
| ### Testing Conventions | ||
|
|
||
| **testthat Version**: 3.0+ | ||
| - β Use `label` parameter for custom messages | ||
| - β Don't use `info` parameter (deprecated in 3.0+) | ||
| ```r | ||
| # Correct | ||
| expect_equal(x, y, label = "Description") | ||
|
|
||
| # Wrong (old syntax) | ||
| expect_equal(x, y, info = "Description") | ||
| ``` | ||
|
|
||
| **Data Comparison Best Practices**: | ||
| - Always compare against the reference implementation (`get_cansim()` in-memory) | ||
| - Sort by ALL dimension columns, not just REF_DATE and DGUID | ||
| - Use `get_cansim_column_list()` to get complete dimension list | ||
| - Use existing `count_differences()` function for comprehensive cell-by-cell comparison | ||
| - Account for trailing spaces in SCALAR_FACTOR | ||
|
|
||
| **Test Structure**: | ||
| - Skip network-dependent tests on CRAN with `skip_on_cran()` | ||
| - Skip offline tests with `skip_if_offline()` | ||
| - Use clear, descriptive test names | ||
| - Include `label` parameters for better failure messages | ||
|
|
||
| ### Performance Optimization Patterns | ||
|
|
||
| **Transaction Usage**: | ||
| - SQLite operations benefit from batched transactions | ||
| - Wrap multiple index creations in single transaction (30-50% faster) | ||
| - Wrap all CSV chunk writes in single transaction (10-20% faster) | ||
| - Always include proper rollback on errors | ||
|
|
||
| **Index Creation**: | ||
| - Always run `ANALYZE` after creating indexes | ||
| - This updates query planner statistics (`sqlite_stat1` table) | ||
| - Results in 5-15% faster queries | ||
|
|
||
| **Chunk Sizing**: | ||
| - Consider both symbol columns AND total column count | ||
| - Wide tables (>50 columns) need smaller chunks | ||
| - Maintain minimum chunk size (10,000 rows) for efficiency | ||
|
|
||
| ### Package Build Configuration | ||
|
|
||
| **Always exclude from .Rbuildignore**: | ||
| - Development artifacts: `^CODE_REVIEW.md`, `^ANALYSIS.md`, etc. | ||
| - Benchmark directories: `^benchmarks$`, `^benchmarks/*` | ||
| - Workflow-specific files | ||
| - Large data files for testing | ||
|
|
||
| **Include in package**: | ||
| - Core R code | ||
| - Tests | ||
| - Documentation (vignettes, man pages) | ||
| - Essential data files | ||
| - NEWS.md, README.md | ||
|
|
||
| --- | ||
|
|
||
| ## Code Style & Conventions | ||
|
|
||
| ### R Coding Style | ||
| - Function names: `snake_case` | ||
| - Use tidyverse patterns (dplyr, tidyr) | ||
| - Consistent indentation (2 spaces) | ||
| - roxygen2 documentation for all exported functions | ||
| - `@keywords internal` for non-exported functions | ||
|
|
||
| ### Error Handling | ||
| - Use `tryCatch` for operations that may fail | ||
| - Provide clear, actionable error messages | ||
| - Include context in errors (which table, which operation) | ||
| - Use `warning()` for non-critical issues, `stop()` for critical ones | ||
|
|
||
| ### Messages to Users | ||
| - Use `message()` for progress updates | ||
| - Include progress indicators for long operations: `[1/5] Processing...` | ||
| - Make messages conditional based on verbosity settings where applicable | ||
|
|
||
| --- | ||
|
|
||
| ## Common Pitfalls to Avoid | ||
|
|
||
| ### β Don't: | ||
| 1. **Hardcode table or field names** - always detect dynamically | ||
| 2. **Assume data structure** - verify with actual schema inspection | ||
| 3. **Use outdated testthat syntax** - check package DESCRIPTION for version | ||
| 4. **Compare formats only to each other** - always include reference implementation | ||
| 5. **Skip running actual tests** - syntax validation alone is insufficient | ||
| 6. **Create workflow-specific artifacts** - they clutter the repo | ||
| 7. **Forget to update .Rbuildignore** - when adding development files | ||
|
|
||
| ### β Do: | ||
| 1. **Run `devtools::test()` or `R CMD check`** before submitting changes | ||
| 2. **Check DESCRIPTION** for dependency versions | ||
| 3. **Study existing test patterns** in `tests/testthat/` | ||
| 4. **Verify assumptions** against actual codebase behavior | ||
| 5. **Use existing helper functions** (like `count_differences()`) | ||
| 6. **Update NEWS.md** for user-visible changes | ||
| 7. **Add comprehensive roxygen2 documentation** | ||
|
|
||
| --- | ||
|
|
||
| ## Testing Checklist | ||
|
|
||
| Before submitting a PR: | ||
| - [ ] All R files load without syntax errors | ||
| - [ ] `devtools::test()` passes locally | ||
| - [ ] New functions have roxygen2 documentation | ||
| - [ ] Tests use `label` not `info` for testthat 3.0+ | ||
| - [ ] Network tests have `skip_on_cran()` | ||
| - [ ] Data consistency validated across formats | ||
| - [ ] NEWS.md updated for user-visible changes | ||
| - [ ] .Rbuildignore updated for new development files | ||
| - [ ] No hardcoded assumptions about schema | ||
|
|
||
| --- | ||
|
|
||
| ## Performance Optimization Guidelines | ||
|
|
||
| ### When to Optimize | ||
| - Database operations (high ROI) | ||
| - Repeated operations in loops | ||
| - I/O operations (CSV reading, network calls) | ||
| - Memory usage for wide/large tables | ||
|
|
||
| ### How to Optimize (Conservative Approach) | ||
| 1. Use transactions for batched operations | ||
| 2. Run ANALYZE after index creation | ||
| 3. Adaptive chunk sizing based on table characteristics | ||
| 4. Cache metadata where safe | ||
| 5. Use standard database best practices | ||
|
|
||
| ### What NOT to Optimize (Yet) | ||
| - Don't break backward compatibility | ||
| - Don't use risky techniques without thorough testing | ||
| - Don't optimize without benchmarking | ||
| - Don't add complex dependencies | ||
|
|
||
| ### Benchmarking | ||
| - Use `microbenchmark` package (in Suggests) | ||
| - Create reproducible benchmarks in `benchmarks/` directory | ||
| - Test with realistic data sizes | ||
| - Document expected improvements in NEWS.md | ||
|
|
||
| --- | ||
|
|
||
| ## Communication with Maintainers | ||
|
|
||
| ### Commit Message Style | ||
| - Prefix: `perf:` for performance, `docs:` for documentation, `fix:` for bugs, `test:` for tests | ||
| - First line: concise summary (50 chars) | ||
| - Body: detailed explanation of what and why | ||
| - Include co-authoring attribution for AI assistance | ||
|
|
||
| ### PR Best Practices | ||
| - Comprehensive description with performance tables | ||
| - Link to relevant documentation | ||
| - Include validation results | ||
| - Explain trade-offs and design decisions | ||
| - Keep PRs focused (one concern per PR) | ||
|
|
||
| ### What Maintainers Value | ||
| - No breaking changes | ||
| - Comprehensive testing | ||
| - Clear documentation | ||
| - Conservative, safe optimizations | ||
| - Evidence of performance improvements | ||
|
|
||
| --- | ||
|
|
||
| ## Useful Commands | ||
|
|
||
| ### Testing | ||
| ```bash | ||
| # Run all tests | ||
| Rscript -e "devtools::test()" | ||
|
|
||
| # Run specific test file | ||
| Rscript -e "testthat::test_file('tests/testthat/test-performance_optimizations.R')" | ||
|
|
||
| # Check package | ||
| R CMD check . | ||
| ``` | ||
|
|
||
| ### Benchmarking | ||
| ```bash | ||
| # Quick validation | ||
| Rscript benchmarks/quick_validation.R | ||
|
|
||
| # Full benchmark suite | ||
| Rscript benchmarks/database_operations_benchmark.R | ||
| ``` | ||
|
|
||
| ### Loading Functions for Testing | ||
| ```r | ||
| library(dplyr) | ||
| library(DBI) | ||
| library(RSQLite) | ||
| source("R/cansim_helpers.R") | ||
| source("R/cansim_sql.R") | ||
| source("R/cansim_parquet.R") | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Project-Specific Context | ||
|
|
||
| ### Data Sources | ||
| - Statistics Canada (StatCan) public data | ||
| - Tables identified by NDM numbers (e.g., "23-10-0061") | ||
| - Data can be large (gigabytes for census tables) | ||
| - Network downloads can be slow/unreliable | ||
|
|
||
| ### User Expectations | ||
| - Fast access to cached data | ||
| - Minimal memory usage | ||
| - Clear progress indicators for long operations | ||
| - No breaking changes to existing code | ||
| - Bilingual support (English/French) | ||
|
|
||
| ### Common Use Cases | ||
| 1. Downloading and caching tables locally | ||
| 2. Filtering data at database level before loading to memory | ||
| 3. Working with very large census tables | ||
| 4. Repeated access to same tables within R session | ||
| 5. Comparing data across different time periods | ||
|
|
||
| --- | ||
|
|
||
| ## Notes for Future Work | ||
|
|
||
| ### Potential Future Optimizations | ||
| - Metadata hierarchy caching (pre-computed hierarchies) | ||
| - Parallel Arrow operations (multi-threaded parquet reads) | ||
| - Connection pooling (reuse connections within session) | ||
| - Vectorized string operations (data.table for factor conversion) | ||
|
|
||
| ### Technical Debt to Watch | ||
| - Session-level connection cache infrastructure exists but isn't actively used yet | ||
| - Metadata file caching is write-only (could be read on reconnect) | ||
| - Progress timing could be added to index creation messages | ||
|
|
||
| ### Known Limitations | ||
| - Network dependency for initial downloads | ||
| - Large memory usage for very wide tables (mitigated by adaptive chunking) | ||
| - Sequential batch processing (could be parallelized in future) | ||
|
|
||
| --- | ||
|
|
||
| ## Changelog | ||
|
|
||
| ### 2025-11-15: Phase 1 - Database Performance (v0.4.5 PR #141) | ||
| - Learned: SQLite tables named `cansim_{number}`, not "data" | ||
| - Learned: testthat 3.0+ uses `label` not `info` | ||
| - Learned: Always compare against reference implementation | ||
| - Implemented: Batched index creation (30-50% faster) | ||
| - Implemented: Transaction-wrapped CSV conversion (10-20% faster) | ||
| - Implemented: ANALYZE command for query optimization | ||
| - Implemented: Adaptive chunk sizing for wide tables | ||
| - Created: Comprehensive benchmark infrastructure | ||
|
|
||
| ### 2025-11-15: Phase 2 - Data Processing & Metadata (v0.4.5) | ||
| - Learned: `vapply` with pre-allocation faster than `lapply %>% unlist` | ||
| - Learned: Pre-split coordinates once, reuse for all fields in loop | ||
| - Learned: Session-level caching excellent for repeated operations | ||
| - Learned: Recursive algorithms with memoization beat iterative loops for tree structures | ||
| - Learned: Base R `strsplit` faster than `stringr::str_split` for simple cases | ||
| - Implemented: Vectorized coordinate normalization (30-40% faster) | ||
| - Implemented: Date format caching (70-90% faster for cached tables) | ||
| - Implemented: Pre-split coordinates for factor conversion (25-40% faster) | ||
| - Implemented: Recursive hierarchy building with memoization (30-50% faster) | ||
| - Pattern: Always analyze loop iterations - if doing same operation N times, hoist it out | ||
|
|
||
| --- | ||
|
|
||
| **Last Updated**: 2025-11-15 by Claude (Phase 2 Performance Optimizations) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mountainMath example of an finetuning instructions doc. these agents (whether Claude here, or any other agentic CLI tool can be referred to adopt these) can maintain and reference these in context to better align how they work in those codebase with your expectations and requirements.