AE-956 [ads] Collect DAP results and add to BigQuery table #474

gleonard-m · 2026-01-22T18:58:25Z

This docker-etl job reads configuration from an ads owned bucket, collects DAP results and inserts into a BigQUery table. If the table does not exist the job creates it.

Checklist for reviewer:

Commits should reference a bug or github issue, if relevant (if a bug is
referenced, the pull request should include the bug number in the title)
Scan the PR and verify that no changes (particularly to
.circleci/config.yml) will cause environment variables (particularly
credentials) to be exposed in test logs
Ensure the container image will be using permissions granted to
telemetry-airflow
responsibly.

Note for deployments: In order to push images built by this PR, the user who merges the PR
must be in the telemetry Github team.
This is because deploys depend on the
data-eng-airflow-gcr CircleCI context.
See DENG-8850 for additional discussion.

…ks-into-bq

…ribution-tasks-into-bq' into AE-956-etl-read-results-from-attribution-tasks-into-bq

mashalifshin

This code looks great! I like how you chose to name the files. Tomorrow I will grab the example config from your MARS PR, put some data in DAP, and give it some manual testing.

My main question after reading the code is wondering how we could DRY up some of the shared code between ads-attribution-dap-collector and ads-incrementality-dap-collector. Perhaps for a follow up PR.

mashalifshin · 2026-01-23T03:06:54Z

jobs/ads-attribution-dap-collector/ads_attribution_dap_collector/main.py

+LOG_FILE_NAME = f"{datetime.now()}-ads-newtab-attribution-dap-collector.log"
+
+
+def write_job_logs_to_bucket(gcp_project: str, config_bucket: str):


Maybe this function could go into the persist.py file so that main.py only contains the main function?

persist.py was intended to isolate the BQ code which is why the write_job_logs_to_bucket wasn't included there.

Ahh okay, that makes sense, I was thinking "persist" could mean more generally "persist logs to a file"...I wonder if there is some other place this could go, for easier readability of the main file.

jobs/ads-attribution-dap-collector/ads_attribution_dap_collector/schema.py

gleonard-m · 2026-01-23T13:54:42Z

This code looks great! I like how you chose to name the files. Tomorrow I will grab the example config from your MARS PR, put some data in DAP, and give it some manual testing.

My main question after reading the code is wondering how we could DRY up some of the shared code between ads-attribution-dap-collector and ads-incrementality-dap-collector. Perhaps for a follow up PR.

The code bases for each job are separate (each job has its own GHA, docker image) making DRY probably more work than it is worth at this point, but something we could look at in the future.

scholtzan

This looks good from a DE perspective. The job seems to just write to ads_dap_derived.newtab_attribution_v1

dmueller · 2026-01-26T18:46:42Z

jobs/ads-attribution-dap-collector/tests/test_collect.py

+        )
+        self.assertEqual(batch_start, date(2026, 1, 15))
+
+    def test_current_batch_end_(self):


trailing _ on test name?

Yep, updated.

dmueller · 2026-01-26T18:54:07Z

jobs/ads-attribution-dap-collector/ads_attribution_dap_collector/schema.py

+    Note: ads keys are dynamic (source:id), so they remain a dict[str, AdModel].
+    """
+
+    model_config = ConfigDict(extra="forbid")


i interpret this to mean if the json config has any extra fields then that would be an error. what would the process be for adjusting the shape of the config file / any reason to use forbid over the default ignore?

No reason, forbid is too tight here and would lead to needing to update this job when other consumers of the file require changes to the config, updated.

dmueller · 2026-01-26T18:56:39Z

jobs/ads-attribution-dap-collector/ads_attribution_dap_collector/parse.py

+    try:
+        ad_id = int(ad_id_str)
+    except ValueError:
+        raise ValueError(
+            f"Skipping invalid ad key '{ad_key}': ad_id '{ad_id_str}' is not an integer"
+        )


the assumption that ad_id is an integer is tough because ad_id is an externally defined id space.

Updated the code and table to use a string instead.

dmueller · 2026-01-26T19:00:27Z

jobs/ads-attribution-dap-collector/ads_attribution_dap_collector/collect.py

+
+def current_batch_end(batch_start: date, duration: int) -> date:
+    # since the start and end dates are inclusive need to subtract 1 from duration
+    return batch_start + timedelta(seconds=duration, days=-1)


does the days=-1 make an assumption that collector duration is a multiple of seconds in a day? I didn't notice any check that would enforce that, so wondered if it was lurking here.

The Advertiser definition is schema.py requires the value to be
collector_duration: int = Field(gt=86399) # 1 day - 1 sec in seconds
to guard against duration being specified in hours or days.

mashalifshin

Glenda, I'm working through testing locally/in my dev sandbox.
Verified

Fails gracefully with good error when config is missing
Fails gracefully with good error when config is invalid

Now I'm trying to test the happy path, but I can't get it to collect results when I'd expect.
For a batch start of 1-15-2026, and a batch duration of 604800 (7 days), shouldn't it collect on process_date 1-22-2026? I've tried a couple days around that (1-23, 1-24, 1-25), and it always says

No results available for advertiser: nature with start_date: 2026-01-15 and process_date: 2026-01-22

I double checked that my task ID that I created for this experiment is in the partners list for the nature advertiser's partner id ... am I missing something, or could it be a bug?

mashalifshin · 2026-01-24T00:40:29Z

jobs/ads-attribution-dap-collector/ads_attribution_dap_collector/main.py

+LOG_FILE_NAME = f"{datetime.now()}-ads-newtab-attribution-dap-collector.log"
+
+
+def write_job_logs_to_bucket(gcp_project: str, config_bucket: str):


Ahh okay, that makes sense, I was thinking "persist" could mean more generally "persist logs to a file"...I wonder if there is some other place this could go, for easier readability of the main file.

mashalifshin · 2026-01-26T20:08:17Z

The code bases for each job are separate (each job has its own GHA, docker image) making DRY probably more work than it is worth at this point, but something we could look at in the future.

Yeah when we have time maybe we can extract a shared python lib for the GCP stuff, and another for the DAP stuff. And also clean up some of the older jobs that (I think) this one will take precedent over.

mashalifshin · 2026-01-26T20:09:15Z

@gleonard-m one more thought, would you mind adding an example-config.json file somewhere in the root here, or in the README? Just a convenience for the local dev and for testing (had to grab this from the MARS PR).

gleonard-m · 2026-01-26T21:26:01Z

@gleonard-m one more thought, would you mind adding an example-config.json file somewhere in the root here, or in the README? Just a convenience for the local dev and for testing (had to grab this from the MARS PR).

Added an example to the README.md.

mashalifshin

@gleonard-m thanks for your help troubleshooting why I couldn't collect, I was indeed missing something: The date to collect for a 7 day batch that starts 1-15-2026 is actually 1-21-2026, forgot about the adjustment airflow makes to pass the previous date when it runs the job.

So it works like a charm! Verified

Advertisers with start dates in the future gets gracefully skipped
Can successfully collect with the right process date, config file, and DAP setup

gleonard-m · 2026-01-27T16:31:16Z

@gleonard-m one more thought, would you mind adding an example-config.json file somewhere in the root here, or in the README? Just a convenience for the local dev and for testing (had to grab this from the MARS PR).

Yep, added an example to the README.md

gleonard-m added 2 commits January 22, 2026 13:52

Initial version of ads-attribution-dap-collector job

2139155

Initial version of ads-attribution-dap-collector job

2c34e49

gleonard-m requested a review from a team as a code owner January 22, 2026 18:58

gleonard-m and others added 7 commits January 22, 2026 14:13

Downgraded pytest version

42e9be7

Downgraded python version

64160ab

Merge branch 'main' into AE-956-etl-read-results-from-attribution-tas…

1f19211

…ks-into-bq

Updated pytest version

778df84

Merge remote-tracking branch 'origin/AE-956-etl-read-results-from-att…

3864126

…ribution-tasks-into-bq' into AE-956-etl-read-results-from-attribution-tasks-into-bq

Use python 3.12 and update ci_job.yml

6f38257

Update job-ads-attribution-dap-collector.yml

4c49a27

mashalifshin reviewed Jan 23, 2026

View reviewed changes

Renamed schema classes

16fab7c

scholtzan approved these changes Jan 23, 2026

View reviewed changes

dmueller reviewed Jan 26, 2026

View reviewed changes

mashalifshin reviewed Jan 26, 2026

View reviewed changes

gleonard-m added 3 commits January 26, 2026 15:59

Trailing _ in test function

45700ed

Relaxed schema for extra fields

305a6cb

Update README.md to include a config example

714325f

mashalifshin approved these changes Jan 26, 2026

View reviewed changes

updated ad_id type from int to string

2b0a4d6

gleonard-m merged commit 8a2fb67 into main Jan 27, 2026
3 checks passed

gleonard-m deleted the AE-956-etl-read-results-from-attribution-tasks-into-bq branch January 27, 2026 17:47

gleonard-m mentioned this pull request Jan 27, 2026

AE-956 [ads] Updated to reflect GHA changes #475

Merged

3 tasks

		LOG_FILE_NAME = f"{datetime.now()}-ads-newtab-attribution-dap-collector.log"


		def write_job_logs_to_bucket(gcp_project: str, config_bucket: str):

AE-956 [ads] Collect DAP results and add to BigQuery table #474

AE-956 [ads] Collect DAP results and add to BigQuery table #474

Uh oh!

Conversation

gleonard-m commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mashalifshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gleonard-m commented Jan 23, 2026

Uh oh!

scholtzan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gleonard-m Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gleonard-m Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mashalifshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mashalifshin commented Jan 26, 2026

Uh oh!

mashalifshin commented Jan 26, 2026

Uh oh!

gleonard-m commented Jan 26, 2026

Uh oh!

mashalifshin left a comment

Choose a reason for hiding this comment

Uh oh!

gleonard-m commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gleonard-m commented Jan 22, 2026 •

edited

Loading

gleonard-m Jan 26, 2026 •

edited

Loading

gleonard-m Jan 26, 2026 •

edited

Loading