feat(smoking): Add smoking variable harmonization (CEP-002)#163
feat(smoking): Add smoking variable harmonization (CEP-002)#163DougManuel wants to merge 9 commits intov3from
Conversation
Foundation for modular derived variable calculations: - clean_variables(): Step 1 & 3 preprocessing/validation - missing-data-functions.R: any_missing(), get_priority_missing() - missing-pattern-cache.R: Pattern detection for PUMF/Master codes - parse-range-notation.R: Range parsing for validation bounds - worksheet-getters.R: get_variable_details() metadata access - worksheet-loaders.R: load_worksheet_metadata() - file-sourcing.R: source_r_robust() for dependency loading - variable-discovery.R: Variable lookup utilities This infrastructure supports the new 3-step pattern: 1. clean_variables() - preprocess inputs 2. Domain logic - status-based calculations 3. clean_variables() - validate output bounds
Primary recommended variables using 3-step architecture: - smoking-status.R: calculate_SMKDSTY_A(), calculate_SMKDSTY_cat6() - smoke-start.R: calculate_age_start_smoking() - unified initiation age - smoking-cessation.R: calculate_time_quit_smoking() - years since quit - smoke-intensity.R: calculate_cigs_per_day() - routes SMK_204/SMK_208 - smoke-pack-years.R: calculate_pack_years() - cumulative exposure - smoke-stop.R: Supporting cessation logic - smoking-validation-constants.R: PACK_YEARS_CONSTANTS Key design decisions: - Single calculate_pack_years() works for both PUMF and Master - Unified feeders (age_start_smoking, cigs_per_day, time_quit_smoking) handle PUMF vs Master routing internally - PUMF has ~15-20% relative error due to midpoint estimation - Era-agnostic: handles 2001-2023 variable naming variations See ceps/cep-002-smoking/ for full specification.
Worksheets for smoking variable harmonization: - smoking_variables.csv: Variable definitions for smoking domain - smoking_variable_details.csv: Recoding rules and mappings Covers 5 subgroups: - 01-status: SMKDSTY_cat6 (6-category smoking status) - 02-initiation: age_start_smoking (age started daily) - 03-cessation: time_quit_smoking (years since quit) - 04-intensity: cigs_per_day (cigarettes per day) - 05-pack-years: pack_years_der (cumulative exposure) Supports all PUMF cycles 2001-2022.
Tests for primary recommended smoking variables: - test-age_start_smoking.R: Initiation age routing and bounds - test-cigs_per_day.R: Intensity routing by smoking status - test-pack_years.R: Cumulative exposure calculation - test-time_quit_smoking.R: Cessation timing validation Tests verify: - Correct routing based on SMKDSTY_A status - Valid output ranges per variable_details.csv bounds - Missing value handling (tagged_na vs numeric) - Universe validation (correct NA for out-of-scope respondents)
Quarto documentation for smoking variable harmonization: Main documents: - cep-002-smoking.qmd: Methodology and rationale - 00-variable-summary.qmd: Variable overview and recommendations - derived-functions.qmd: DV function specifications Subgroup specifications (QMD + worksheet CSVs): - 01-status: SMKDSTY_cat6 (6-category smoking status) - 02-initiation: age_start_smoking - 03-cessation: time_quit_smoking - 04-intensity: cigs_per_day - 05-pack-years: pack_years_der Rendered site: https://dmanuel.quarto.pub/cep-002-smoking-variables
Updates variables.csv and variable_details.csv with smoking harmonization: - 34 existing smoking variables updated with v3 definitions - 19 new smoking variables added - Extends coverage to 2022-2023 Master files - Adds unified feeder variables (age_start_smoking, cigs_per_day, etc.) Removes separate smoking_*.csv files (now merged into main worksheets). Variables: 360 → 379 (+19 new) Variable details: 3468 → 3678 (+210 net, replacing 500 with 710 improved rows)
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive smoking variable harmonization for CCHS cycles 2001-2023, implementing CEP-002. It introduces 19 new variables and updates 34 existing ones, extending coverage to Master files and establishing a new 3-step derived variable architecture.
Changes:
- New smoking variables across 5 subgroups (status, initiation, cessation, intensity, pack-years)
- Extended cycle coverage to PUMF 2001-2023 and Master 2001-2023
- New unified derived variable functions with standardized architecture
- Comprehensive test suite for derived variables
- Full CEP documentation in Quarto format
Reviewed changes
Copilot reviewed 38 out of 42 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-time_quit_smoking.R | Unit tests for cessation timing calculation (337 lines) |
| tests/testthat/test-pack_years.R | Unit tests for pack-years derivation (386 lines) |
| tests/testthat/test-cigs_per_day.R | Unit tests for cigarettes per day routing (632 lines) |
| tests/testthat/test-age_start_smoking.R | Unit tests for initiation age calculation (353 lines) |
| ceps/cep-002-smoking/derived-functions.qmd | Function reference documentation (448 lines) |
| ceps/cep-002-smoking/cep-002-smoking.qmd | Main methodology and rationale (598 lines) |
| ceps/cep-002-smoking/05-pack-years.qmd | Pack-years variable documentation (276 lines) |
| ceps/cep-002-smoking/04-intensity.qmd | Intensity variable documentation (497 lines) |
| ceps/cep-002-smoking/03-cessation.qmd | Cessation variable documentation (555 lines) |
| ceps/cep-002-smoking/02-initiation.qmd | Initiation variable documentation (808 lines) |
| ceps/cep-002-smoking/00-variable-summary.qmd | Variable summary table (266 lines) |
| CSV worksheets (multiple) | Variable definitions and recoding rules |
| R/smoking-validation-constants.R | Validation constants for smoking functions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
inst/extdata/variables.csv
Outdated
| "REP_5A","Repetitive strain injury - walking","Repetitive strain injury - type of activity - walking","Categorical","cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_s","cchs2015_2016_p::INJ_020A, cchs2017_2018_p::INJ_020A, [REP_5A]","Repetitive strain injury","Health status","N/A",NA,"","2.2.0","2025-06-30","Variable metadata completed","",NA,"active",NA | ||
| "REP_5B","Repetitive strain injury - sport/physical activity","Repetitive strain injury - type of activity - sports or physical exercise","Categorical","cchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_s","cchs2001_p::REPA_4A, cchs2003_p::REPC_4A, cchs2005_p::REPE_4A, cchs2007_2008_p::REP_4A, cchs2015_2016_p::INJ_020B, cchs2017_2018_p::INJ_020B, [REP_5B]","Repetitive strain injury","Health status","N/A",NA,"","2.2.0","2025-06-30","Variable metadata completed","",NA,"active",NA | ||
| "REP_5C","Repetitive strain injury - leisure/hobby","Repetitive strain injury - type of activity - leisure or hobby","Categorical","cchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_s","cchs2001_p::REPA_4B, cchs2003_p::REPC_4B, cchs2005_p::REPE_4B, cchs2007_2008_p::REP_4B, cchs2015_2016_p::INJ_020C, cchs2017_2018_p::INJ_020C, [REP_5C]","Repetitive strain injury","Health status","N/A",NA,"","2.2.0","2025-06-30","Variable metadata completed","",NA,"active",NA | ||
| "REP_5D","Repetitive strain injury - |
There was a problem hiding this comment.
SMK_09C (Years since stopped smoking daily - former daily) available as SMK_090 for 2015-2016 and 2017-2018
There was a problem hiding this comment.
Thanks for catching this.
Issue confirmed: The 2015 CCHS redesign renamed the cessation timing variables:
SMK_09C → SMK_090 (years since stopped smoking daily)
Also fixed:
SMK_06C → SMK_070 (years since stopped smoking - former occasional)
The worksheets were using [SMK_09C] and [SMK_06C] bracket fallback notation, which only works when the source variable name matches. For 2015+ cycles, this would fail silently because the variables are named SMK_090 and SMK_070.
Fix applied: Added explicit mappings for 2015-2021 cycles to both variables in variables.csv:
SMK_09C: cchs2015_2016_m::SMK_090, cchs2017_2018_m::SMK_090, cchs2019_2020_m::SMK_090, cchs2021_m::SMK_090
SMK_06C: cchs2015_2016_m::SMK_070, cchs2017_2018_m::SMK_070, cchs2019_2020_m::SMK_070, cchs2021_m::SMK_070
Validation: The fixes have been verified against the source worksheet structure. The [SMK_09C] and [SMK_06C] fallback now only applies to 2007-2014 cycles where those names are correct.
Prevention: Updated the skill documentation with the 2015 cessation variable rename patterns and added a mandatory validation step to catch this class of error in future worksheet authoring.
The 2015 CCHS redesign renamed cessation timing variables: - SMK_09C -> SMK_090 (years since stopped daily) - SMK_06C -> SMK_070 (years since stopped occasional) The [SMK_09C] and [SMK_06C] bracket fallback doesn't work for 2015+ cycles where the variables have different names. Added explicit mappings for cchs2015_2016_m through cchs2021_m. Fixes issue identified in PR #163 review.
Rewrote CSV files with proper quoting using readr::write_csv() to fix "excessive quoting" errors in the CSV formatting check.
| "SMK_10A_cont","Yrs quit (reducer)","Years since quit completely (former daily who continued occasional, continuous)","Continuous","cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2011_2012_p, cchs2013_2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2003_m, cchs2005_m, cchs2007_2008_m, cchs2009_2010_m, cchs2011_2012_m, cchs2013_2014_m, cchs2015_2016_m, cchs2017_2018_m, cchs2019_2020_m, cchs2021_m, cchs2023_m","cchs2003_p::SMKC_10A, cchs2003_m::SMKC_10A, cchs2005_p::SMKE_10A, cchs2005_m::SMKE_10A, cchs2015_2016_p::SMK_100, cchs2015_2016_m::SMK_100, cchs2017_2018_p::SMK_100, cchs2017_2018_m::SMK_100, cchs2019_2020_m::SMK_100, cchs2021_m::SMK_100, cchs2023_m::SPU_35, [SMK_10A]","smoking","Health behaviour","years","Universe: former daily who continued occasional. Converts categorical SMK_10A/SPU_35 to pseudo-continuous using midpoint. Not available 2001, 2022 (no categorical - uses SPU_35A/B month/year). Use SMKDVSTP for 2022.","Pseudo-continuous years since quit completely for former daily who reduced to occasional. Derived from categorical using midpoint conversion.","3.0.0-alpha","2026-01-11","Removed cchs2022_m - 2022 uses month/year (SPU_35A/B) not categorical. Added cchs2023_m::SPU_35.",,,"active" | ||
| "SMK_10_gate","Quit gate","Quit completely when stopped daily (gate variable)","Categorical","cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2011_2012_p, cchs2013_2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2003_m, cchs2005_m, cchs2007_2008_m, cchs2009_2010_m, cchs2011_2012_m, cchs2013_2014_m, cchs2015_2016_m, cchs2017_2018_m, cchs2019_2020_m, cchs2021_m, cchs2022_m, cchs2023_m","cchs2003_p::[SMKC_10], cchs2005_p::[SMKE_10], cchs2007_2008_p::[SMK_10], cchs2009_2010_p::[SMK_10], cchs2011_2012_p::[SMK_10], cchs2013_2014_p::[SMK_10], cchs2015_2016_p::[SMK_095], cchs2017_2018_p::[SMK_095], cchs2003_m::[SMKC_10], cchs2005_m::[SMKE_10], cchs2007_2008_m::[SMK_10], cchs2009_2010_m::[SMK_10], cchs2011_2012_m::[SMK_10], cchs2013_2014_m::[SMK_10], cchs2015_2016_m::[SMK_095], cchs2017_2018_m::[SMK_095], cchs2019_2020_m::[SMK_095], cchs2021_m::[SMK_095], cchs2022_m::[SPU_30], cchs2023_m::[SPU_30]","smoking","Health behaviour","N/A","Universe: former daily smokers. Gate for quit pathway: 1=quit when stopped daily, 2=continued occasional. Not available 2001.","Determines if respondent quit completely when they stopped daily smoking or continued occasional smoking.","3.0.0-alpha","2026-01-04","Labels updated 2026-01-04. Universe context added.",,,"active" | ||
| "quit_pathway","Quit pathway","Smoking cessation pathway (direct, gradual, or former occasional)","Categorical","cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2011_2012_p, cchs2013_2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2003_m, cchs2005_m, cchs2007_2008_m, cchs2009_2010_m, cchs2011_2012_m, cchs2013_2014_m, cchs2015_2016_m, cchs2017_2018_m, cchs2019_2020_m, cchs2021_m, cchs2022_m, cchs2023_m","DerivedVar::[SMKDSTY_cat5, SMK_10_gate]","smoking","Health behaviour","N/A","Derived from SMKDSTY_cat5 and SMK_10_gate. Not available 2001 (no SMK_10_gate). {recommended:secondary}","Categorical indicator: 1=direct quit (quit when stopped daily), 2=gradual quit (continued occasional then quit), 3=former occasional (never daily)","3.0.0-alpha","2026-01-04","Labels updated 2026-01-04",,,"active" | ||
| "time_quit_smoking","Yrs quit smoking*","Years since quit smoking","Continuous","cchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2011_2012_p, cchs2013_2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2019_2020_p, cchs2001_m, cchs2003_m, cchs2005_m, cchs2007_2008_m, cchs2009_2010_m, cchs2011_2012_m, cchs2013_2014_m, cchs2015_2016_m, cchs2017_2018_m, cchs2019_2020_m, cchs2021_m, cchs2022_m, cchs2023_m","DerivedVar::[SMK_09A_cont, SMK_06A_cont, SMKDVSTP]","smoking","Health behaviour","years","{recommended:primary} {sub_subject:cessation} Universe: all former smokers. Routes: PUMF uses SMK_09A_cont/SMK_06A_cont (midpoint-converted); Master 2003+ uses SMKDVSTP (StatCan derived continuous).","Unified continuous years since quit smoking. PUMF: midpoint from categorical; Master: SMKDVSTP pass-through.","3.0.0-alpha","2026-01-11","Fixed: Changed SPU_25 to SMKDVSTP for Master 2022/2023 (SPU_25 is categorical, SMKDVSTP is continuous).",,,"active" |
There was a problem hiding this comment.
time_quit_smoking only exists for PUMF data at the moment
The R/table-generators.R file was accidentally dropped during an auto-stash merge. Recovered from stash index c5de257. Required by all CEP-002 smoking QMD files for rendering summary tables.
yulric
left a comment
There was a problem hiding this comment.
I only reviewed the ceps/cep-002-smoking/cep-002-smoking.qmd file since it seemed like a good starting point to understand the what/why/how of this PR. However there are issues that are making it hard to answer those questions:
- Too many implementation details like all the tables that summarize the different variables.
- Mistakes with regards to things that are not implemented, implemented in a different way etc.
- Not really ordered in a way that explains the each issue with smoking, why its an issue, and the design decisions to fix them.
I've put comments in the file but let me know if you need help and I can fix up the file.
| Smoking is the leading preventable cause of death in Canada. Researchers need: | ||
|
|
||
| - **Smoking status** for exposure classification | ||
| - **Pack-years** for dose-response analysis |
There was a problem hiding this comment.
Change pack-years to intensity?
| - **Cessation timing** for intervention studies | ||
| - **Cross-cycle consistency** for trend analysis | ||
|
|
||
| Current cchsflow has only 4 smoking intensity variables. This CEP adds 47 new variables including the critical `pack_years_der` derived variable. |
There was a problem hiding this comment.
Is there a reason you're focusing on smoking intensity in this sentence as opposed to the total number of smoking variables? Also, I read 3 smoking intensity variables in the old variables sheet, pack years, number of cigarettes smoke daily, and number of months smoked one or more cigarette in the past year.
There was a problem hiding this comment.
There's a discrepancy in this file and the 00-variable-summary.qmd file with regards to the number of new variables added. This file says 47 but the 00-variable-summary.qmd file says 19.
There was a problem hiding this comment.
pack_years_der is already in the original cchsflow, more appropriate to say including improvements to the pack_years_der derived variable.
| │ │ | ||
| ├──► 02-initiation │ | ||
| │ │ | ||
| ├──► 03-cessation ─────────┤ |
There was a problem hiding this comment.
This line from cessation to pack-years does not seem necessary since there's already one at the bottom.
| | 03-cessation | 19 | 16 | 3 | Quit timing and duration | | ||
| | 04-intensity | 4 | 4 | 0 | Cigarettes per day | | ||
| | 05-pack-years | 2 | 2 | 0 | Derived cumulative exposure | | ||
| | **Total** | **51** | **43** | **8** | | |
There was a problem hiding this comment.
The total count differs from what's in the 00-variable-summary.qmd file.
There was a problem hiding this comment.
There's content in this file that I think is more appropriate for the 00-variable-summary.qmd file. For example, Variable count by subgroup table, Cycle coverage summary table, and Era-specific variable mappings table. This file should focus more on the problems that face smoking and how they will be addressed.
|
|
||
| ## Validation approach | ||
|
|
||
| ### L0-L5 validation levels |
There was a problem hiding this comment.
This seems like it's outside the scope of smoking and is more a general validation framework?
|
|
||
| ### Integration testing with PUMF data | ||
|
|
||
| L5 validation tests that recoding works on actual CCHS PUMF data: |
There was a problem hiding this comment.
Does this check expected vs actual? Or just runs through the recoding and it does not show any errors then it assumes everything is good?
| print_integration_result(result) | ||
| ``` | ||
|
|
||
| **What PUMF integration catches that DDI validation misses:** |
There was a problem hiding this comment.
I'm not sure how this does this, what will help is having more info about the process.
|
|
||
| PUMF data location: `~/github/cchsflow-data/data/archive/cchs_odessi_archive/` | ||
|
|
||
| ## Implementation plan |
There was a problem hiding this comment.
This doesn't seem right but also it should not be in this document since this is meant to be a design document.
|
|
||
| - Existing smoking variables (`SMK_204`, `SMK_208`, etc.) continue to work unchanged | ||
| - New `_cont` variables are additions, not replacements | ||
| - `pack_years_der` is a new derived variable |
There was a problem hiding this comment.
Not true since it exists already
Summary
Adds smoking variable harmonization for CCHS cycles 2001-2023, extending coverage to Master files and introducing unified derived variable functions using a new 3-step architecture.
Variables added (19 new)
cigs_per_day,SMK_202,SMK_203,SMK_207SMK_09A,SMK_10_gate,SMK_10A_A,SMK_10A_B,SMK_10A_cont,quit_pathwaySMKDSTY,SMKDGSTP,SMKDGSTP_cont,SMKDVSTPSMK_01C,SMK_040,SMK_06C,SMK_09C,SMKG09C_contVariables updated (34 existing)
variableStartmappings for era-specific source variable namespack_years_cat,pack_years_der,time_quit_smoking,SMKDSTY_A,SMKDSTY_B,SMKDSTY_cat3,SMKDSTY_cat5,SMKG040,SMKG040_cont,SMKG203_A/B/cont,SMKG207_A/B/cont,SMK_005,SMK_01A,SMK_030,SMK_05B,SMK_05C,SMK_05D,SMK_06A_A/B/cont,SMK_09A_A/B/cont,SMK_204,SMK_208,SMKG01C_A/B/cont,SMKG06C,SMKG09CCycle coverage
3-step derived variable architecture
New modular pattern for derived variables:
clean_variables()preprocesses and validates inputsSupporting infrastructure added:
R/clean-variables.R,R/missing-data-functions.R,R/missing-pattern-cache.RR/worksheet-getters.R,R/worksheet-loaders.R,R/variable-discovery.RTests added
test-age_start_smoking.Rtest-cigs_per_day.Rtest-pack_years.Rtest-time_quit_smoking.RDocumentation
Full specification: CEP-002 Smoking Variables
Source QMD files included in
ceps/cep-002-smoking/Test plan
R CMD checktestthat::test_file("tests/testthat/test-pack_years.R")