-
Notifications
You must be signed in to change notification settings - Fork 71
Implementation[OpenhouseCommitEventTablePartitionStats]: Implement partition-level statistics collection and publishing for tables in TableStatsCollectionSparkApp #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation[OpenhouseCommitEventTablePartitionStats]: Implement partition-level statistics collection and publishing for tables in TableStatsCollectionSparkApp #413
Conversation
…atsCollectionSparkApp
apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/TableStatsCollectionSparkApp.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
…r improved clarity and performance
…TablePartitionsStats
…atsCollectorUtil for clarity and consistency
… clarity and consistency
apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/TableStatsCollectionSparkApp.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Show resolved
Hide resolved
…tsCollectorUtil for clarity and consistency
abhisheknath2011
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srawat98-dev for addressing my comments. Yeah lets have a follow up refactoring PR for code readability.
… TableStatsCollectorUtil for improved accuracy and consistency
cbb330
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are unresolved comments, but
I reviewed the unresolved comments and all have either been explicitly responded or else implicitly resolved in code. due to this, i'm going to preemptively resolve all the comments and add shipit so that this PR is unblocked in @srawat98-dev 's IST hours
@abhisheknath2011 / @teamurko feel free to intercede / require a follow up if my conclusion is incorrect
Summary
I extended the existing TableStatsCollectionSparkApp to implement the logic for populating the OpenhouseCommitEventTablePartitionStats table.
This new table will serve as the partition-level source of truth for statistics and commit metadata across all OpenHouse datasets. The table contains exactly one row per partition, where the commit metadata reflects the latest commit that modified that partition. Each record includes:
This enables granular partition-level analytics and monitoring, providing:
Output
This PR ensures the TableStatsCollectionSparkApp executes all 4 collection tasks (table stats, commit events, partition events, and partition stats) synchronously while maintaining complete data collection and publishing functionality.
End-to-End Verification (Docker)
1. Sequential Execution Timeline
Key Points:
2. publishStats Log Output
Key Points:
3. publishCommitEvents Log Output
Key Points:
4. publishPartitionEvents Log Output
Key Points:
5. publishPartitionStats Log Output
Key Points:
6. Job Completion
Key Points:
This Output section:
✅ Shows all 4 publish methods (stats, commit events, partition events, partition stats)
✅ Includes actual log output with JSON data
✅ Highlights the sequential execution pattern
✅ Provides key validation points for each publish method
✅ Demonstrates successful end-to-end execution
✅ Uses your actual Docker test logs
Key Features:
1. Synchronous Sequential Execution
2. Predictable Execution Order
3. Maintained Data Collection Functionality
4. Robust Error Handling
5. Performance Trade-off Accepted
6. Comprehensive Timing Metrics
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.