Skip to content

Conversation

@jeancochrane
Copy link
Member

@jeancochrane jeancochrane commented Jan 2, 2026

This PR tweaks a few data-raw scripts to add 2024 data to the pin, cpi, and eq_factor tables. I have already used this code to load the corresponding files into the testing bucket on S3.

The most complicated of these changes relates to the pin table, whose data source needs to change in 2024 following the Clerk's migration from the AS400 to iasWorld as their source-of-truth database. Rather than pull AV and exemption data from a SQL server mirror of the AS400, as we used to do, we now pull these data from a flat file stored in S3. In future years, we may pull this data from iasWorld directly, so I did a little bit of QC work to check the flat file against iasWorld; they mostly match up, though there remain a few thousand rows with discrepancies that I couldn't track down. (See EI issue 395, which will investigate these discrepancies in more detail.)

Connects #59.

@jeancochrane jeancochrane changed the base branch from master to 2024-data-update January 2, 2026 20:51
@jeancochrane jeancochrane changed the base branch from 2024-data-update to jeancochrane/fix-pre-commit January 12, 2026 16:35
Base automatically changed from jeancochrane/fix-pre-commit to 2024-data-update January 14, 2026 16:17
Comment on lines +30 to +36
# Remove footer lines that do not contain any data
filter(
!str_detect(
vals,
regex("printed by the authority|ptax-115", ignore_case = TRUE)
)
) %>%
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This footer appears to be new as of 2025. See here to examine it: https://tax.illinois.gov/content/dam/soi/en/web/tax/localgovernments/property/documents/cpihistory.pdf

Comment on lines +25 to +28
# Start and end years of data to query, inclusive.
# Set these to the same value if you want to update only one year of data
start_year <- 2006
end_year <- 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these in so that we have an easy way of skipping prior years of data whenever we do an update. This is particularly useful right now because I don't have access to the AS400 mirror, which is required in order to reproduce pre-2024 data. At some point we should get that set up, but I don't want it to block us right now.

# 2023. These values come from the legacy CCAO database, which mirrors the
# county mainframe.
# Only query this data if we are pulling data for years up to 2023
if (start_year <= 2023) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few conditional branches in this file that split depending on whether we're ingesting data before or after 2023. I considered creating a new file dedicated exclusively to post-2024 data manipulation, since it feels to me like this file will get very messy very fast if they substantially change the data model again in the future (causing us to need to introduce further conditional branches based on year). For now, however, modifying this file feels like the simpler path, and I expect it's easier to review anyway.

Comment on lines +98 to +99
# This exemption is new in 2024 and does not exist in the legacy data
exe_vet_dis_100 = 0L
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one change I made to the pre-2023 query.

Comment on lines +156 to +224
pin_exe_vetdis_athena <- dbGetQuery(
ccaoathena,
glue_sql("
WITH long AS (
SELECT
det.parid AS pin,
det.taxyr AS year,
CASE
WHEN
det.excode IN ('DV1', 'C-DV1', 'DV0', 'C-DV0', 'DV-1')
THEN 'exe_vet_dis_lt50'
WHEN det.excode IN ('DV2', 'C-DV2', 'DV-2') THEN 'exe_vet_dis_50_69'
WHEN det.excode IN ('DV3', 'DV3-M', 'DV-3') THEN 'exe_vet_dis_ge70'
WHEN det.excode IN ('DV4', 'DV4-M', 'DV-4') THEN 'exe_vet_dis_100'
END AS exe_name,
COALESCE(cast(det.apother AS INT), 0) AS exe_amount
FROM iasworld.exdet AS det
INNER JOIN iasworld.exadmn AS admn
ON det.parid = admn.parid
AND det.caseno = admn.caseno
AND det.taxyr = admn.taxyr
AND det.excode = admn.excode
AND admn.cur = 'Y'
AND admn.deactivat IS NULL
AND admn.exstat = 'A'
AND (admn.user126 IS NULL OR admn.user126 = 'N')
INNER JOIN iasworld.excode AS code
ON det.excode = code.excode
AND det.taxyr = code.taxyr
AND code.cur = 'Y'
AND code.deactivat IS NULL
WHERE det.cur = 'Y'
AND det.deactivat IS NULL
AND det.excode IN (
'DV1', 'C-DV1', 'DV0', 'C-DV0', 'DV-1',
'DV2', 'C-DV2', 'DV-2',
'DV3', 'DV3-M', 'DV-3',
'DV4', 'DV4-M', 'DV-4'
)
AND det.taxyr >= '2024'
AND det.taxyr <= '{end_year}'
)
SELECT
pin,
year,
CAST(
SUM(
CASE WHEN exe_name = 'exe_vet_dis_lt50' THEN exe_amount ELSE 0 END
)
AS INT) AS exe_vet_dis_lt50,
CAST(
SUM(
CASE WHEN exe_name = 'exe_vet_dis_50_69' THEN exe_amount ELSE 0 END
)
AS INT) AS exe_vet_dis_50_69,
CAST(
SUM(
CASE WHEN exe_name = 'exe_vet_dis_ge70' THEN exe_amount ELSE 0 END
)
AS INT) AS exe_vet_dis_ge70,
CAST(
SUM(
CASE WHEN exe_name = 'exe_vet_dis_100' THEN exe_amount ELSE 0 END
)
AS INT) AS exe_vet_dis_100
FROM long
GROUP BY pin, year
", .con = ccaoathena)
) %>%
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this query logic duplicates the logic in default.vw_pin_exe. However, that view does not provide a method for filtering by CofE (ccao-data/data-architecture#962), which is important in this context. As such, I've decided to duplicate the logic for now, and we can clean this up later once we incorporate CofE flags into the data lake.

Comment on lines +419 to +421
# and Cook Central. We have to load each parquet file individually instead of
# loading them all as a Dataset because some issue with the file metadata causes
# an esoteric error when geoarrow tries to collect the files as a Dataset
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither Billy nor I are sure why this just started breaking recently. However, it prevents us from using geoarrow_collect_sf(), so I refactored to load each Parquet file individually instead -- an ugly solution, but it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant