Skip to content

Conversation

@jtuomist
Copy link
Contributor

@jtuomist jtuomist commented Nov 29, 2025

Description

I make the PR to not forget about this. But the code is not ready yet.

Screenshots/Videos (if applicable)

Add screenshots or videos demonstrating the changes if applicable.

Related issue

E.g. Link to asana, sentry, slack thread etc.

Requirements, dependencies and related PRs

Describe or link possible requirements, dependencies and related PRs here

Additional Notes

Any additional information that reviewers should know about this PR.


✅ Pre-Merge Checklist

Type of Change

  • Set the PR's label to match the nature of this change

Testing

  • Built Unit tests (unit tests added/updated)
  • Built E2E tests (if applicable. E2E tests added/updated)
  • Authorization is tested (permissions and access controls verified)
  • Manually tested locally (functionality verified)
    Manual testing instructions
    If feature requires manual testing by reviewer, you can provide instructions here.

Internationalization & Accessibility

  • New strings are translatable (all user-facing text uses i18n)
  • Accessibility standards met (WCAG compliance, screen reader support)

Dependencies

  • Dependencies are merged (if applicable. If the change depends on other PRs e.g. kausal_common)

Screenshots/Videos (if applicable)

Add screenshots or videos demonstrating the changes if applicable.

Additional Notes

Any additional information that reviewers should know about this PR.


Note

Adds per-column forward fill, replaces skip_rows with skip_row_ranges, and supports YAML templates and column range specs in config to enhance dataset transformation.

  • Core transformation:
    • Forward fill: Add ColumnSpec.forward_fill and _apply_forward_fill() to fill nulls per marked columns before processing.
    • Row skipping: Replace skip_rows with skip_row_ranges; add DatasetSchema.should_skip_row() and read all rows before skipping during iteration.
  • Config loading (load_config):
    • Support column_spec_templates and column range keys (e.g., "4-6"), merging templates with overrides.
    • Parse forward_fill flags into ColumnSpec.
    • Map legacy skip_rows to skip_row_ranges for backward compatibility.
    • Use Path.open for reading YAML.
  • Output/CLI:
    • get_summary() now includes skip_row_ranges.
    • Minor CLI return annotation (# noqa: TRY300).

Written by Cursor Bugbot for commit ac47d9d. This will update automatically on new commits. Configure here.

Comment on lines +61 to +64
for start, end in self.skip_row_ranges:
if start <= row_idx <= end:
return True
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <SIM110> reported by reviewdog 🐶
Use return any(start <= row_idx <= end for start, end in self.skip_row_ranges) instead of for loop

Suggested change
for start, end in self.skip_row_ranges:
if start <= row_idx <= end:
return True
return False
return any(start <= row_idx <= end for start, end in self.skip_row_ranges)

@kausal-code-coverage
Copy link

kausal-code-coverage bot commented Nov 29, 2025

Codecov Report

❌ Patch coverage is 0% with 51 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
notebooks/convert_complex_data.py 0.00% 51 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   41.27%   41.22%   -0.05%     
==========================================
  Files         286      286              
  Lines       32928    32967      +39     
  Branches     4684     4699      +15     
==========================================
  Hits        13590    13590              
- Misses      18281    18320      +39     
  Partials     1057     1057              
Flag Coverage Δ
e2e-tests 11.85% <0.00%> (-0.02%) ⬇️
unittests 34.77% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
notebooks/convert_complex_data.py 0.00% <0.00%> (ø)

Impacted file tree graph

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

)

# Apply forward fill before filtering rows
df = self._apply_forward_fill(df)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Forward fill applies before row filtering causes data leakage

The _apply_forward_fill is applied to all rows before skip_row_ranges filtering occurs. This causes values from rows that should be skipped (e.g., header/explanation rows) to leak into subsequent data rows via forward fill. For example, if row 0 has "Category" text and should be skipped, and row 1 has a null that gets forward-filled, the data row will incorrectly receive the header value "Category". The forward fill operation needs to happen after filtering out skipped rows, not before.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants