diff --git a/datascan/datascan-blog-examples/blog_1_data_profile_configuration_details.md b/datascan/datascan-blog-examples/blog_1_data_profile_configuration_details.md new file mode 100644 index 00000000..d35634ad --- /dev/null +++ b/datascan/datascan-blog-examples/blog_1_data_profile_configuration_details.md @@ -0,0 +1,68 @@ +##### [How to Use Dataplex Data Profile to Unleash the Power of Your Enterprise Data](link) + +Detailed instructions for setting up data profile scan for example scenario in the blog. + +###### _Display Name_ +Provide a descriptive name for your profile scan. Idea is to keep this as unique as possible, since this is used to auto-generate a unique ID for your profile scan. Let’s choose, `inventory items scan`. + +![Define Scan](./images/profile_scan_display_name.png "Define Scan") + +###### _Table to scan_ +Here you can directly specify the path to your BigQuery table since your data is structured data and already organized into a BigQuery table. + +![Table to scan](./images/profile_scan_data_source.png "Table to scan") + +###### _Scope_ +Here you can specify the scope of your scan. It can either be `Entire data`, where the profile scan runs on the whole table every single time. Or it can be `Incremental`, where each scan starts from the end of the previous scan. + +Setting the scope to `Entire data` is useful if you a) don’t intend to receive more data into the source table b) only want to run a one-off scan to just get an initial summary of the data. + +For `Incremental` scans, since the scan needs to keep a history of the last scanned row, you need to specify an unnested column from our source table of `Timestamp` or `Date` type. This should be a `Required` column with values which **monotonically** increases over time. + +Since, in this case, you expect the source table to be updated with ~200 rows every day and you are interested in tracking the insights from the table on a recurring basis, it makes sense to set the scope to `Incremental` and select the `ingestion_timestamp` column as the `Timestamp` column. + +![Scope](./images/profile_scan_scope.png "Scope") + +###### _Filter rows_ +You can specify a filter in the form of a SQL expression to filter the rows based on your condition. This SQL expression should be a valid expression for a WHERE clause in BigQuery standard SQL syntax. These filters will be applied every time this scan runs on the source table. + +Let’s say, the product team’s requirement is to only consider data for `distribution center id` greater than 1. So our row filter condition can be `product_distribution_center_id > 1.` + +You could also leverage row filters to filter out older data in your tables. This can be particularly useful if you have large tables with legacy data that is not particularly interesting from a monitoring perspective. Recall that the product team also wanted to ignore all the inventory data created before 2019. So an additional row filter condition can be `created_at > TIMESTAMP('2018-12-31')`. + +The final row filter condition will be `product_distribution_center_id > 1 and created_at > TIMESTAMP('2018-12-31')` + +![Filter rows](./images/profile_scan_row_filter.png "Filter rows") + +###### _Filter Columns_ +Additionally, you can also filter out specific columns to be scanned by this profile scan. This is particularly useful if you have a prior knowledge of which columns will be particularly interesting to scan. + +For instance, in your case, you know that the ingestion_timestamp column is a required Timestamp column and will not provide any useful profiling information. You can filter out this column by specifying it in the excluded column list. Alternatively, you could specify the columns that you want to be included in the profile scan in the included column list. + +Here, we will exclude the column ingestion_timestamp since we already know its values and are filtering on this column. + +![Filter columns](./images/profile_scan_column_filter.png "Filter columns") + +###### _Sampling size_ +Another way to filter the data to be scanned is to specify a sampling size. If specified, the profile scan result will be based on the sampled data. Sampling is applied after the above two filters are applied. + +Sampling is particularly useful if you expect a large amount of data to be seen for each scan. Specifying a smaller sampling size for such data would provide cost benefits. Choosing the sampling size appropriate for the overall data size to be seen per scan would cause more accurate profile insights. + +Since you only expect ~200 rows to be scanned everyday, you can skip configuring sampling size for this scan. + +![Sampling Size](./images/profile_scan_sampling_size.png "Sampling Size") + +###### _Schedule_ +You can either create an `On-demand` scan which only runs when you explicitly run it or you can specify a `Schedule` to run this scan regularly at a particular time. + +Creating a `Repeat` scheduled scan allows you to automatically trigger a scan around a specific event such as data ingestion time. Since you expect our data to be ingested daily at 8 AM PDT, you can schedule the profile scan to run everyday at 5 PM PDT. This would enable us to gather insights from the data daily. + +![Schedule](./images/profile_scan_schedule.png "Schedule") + +###### _Export scan results to Bigquery table_ + +In `Additional Settings`, you can specify the path to a BigQuery table to keep exporting your profile results for each scan for further analysis. This is particularly useful for building more advanced dashboards using Looker Studio or building upstream detection or forecasting systems leveraging BQML models on profile scan results. + +For this example, let's assume you want to store results `datascan-test-project1.datascan_blog_examples.datascan_inventory_profile_results` table, but this table doesn't exist yet. So, you can specify the path to the dataset `datascan-test-project1.datascan_blog_examples` and give the table name as `datascan_inventory_profile_results`. If this table doesn't exist, it will be created. + +![Export scan results to Bigquery Table](./images/profile_scan_export_results.png "Export scan results to Bigquery Table") diff --git a/datascan/datascan-blog-examples/data-setup/README.md b/datascan/datascan-blog-examples/data-setup/README.md new file mode 100644 index 00000000..0e6ca54a --- /dev/null +++ b/datascan/datascan-blog-examples/data-setup/README.md @@ -0,0 +1 @@ +This directory contains scripts and setup for datascan blog examples. diff --git a/datascan/datascan-blog-examples/images/profile_scan_column_filter.png b/datascan/datascan-blog-examples/images/profile_scan_column_filter.png new file mode 100644 index 00000000..8c9a62b3 Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_column_filter.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_data_source.png b/datascan/datascan-blog-examples/images/profile_scan_data_source.png new file mode 100644 index 00000000..46dd52d7 Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_data_source.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_display_name.png b/datascan/datascan-blog-examples/images/profile_scan_display_name.png new file mode 100644 index 00000000..ec23f12c Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_display_name.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_export_results.png b/datascan/datascan-blog-examples/images/profile_scan_export_results.png new file mode 100644 index 00000000..46dd52d7 Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_export_results.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_row_filter.png b/datascan/datascan-blog-examples/images/profile_scan_row_filter.png new file mode 100644 index 00000000..805361c9 Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_row_filter.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_sampling.png b/datascan/datascan-blog-examples/images/profile_scan_sampling.png new file mode 100644 index 00000000..b9987794 Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_sampling.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_sampling_size.png b/datascan/datascan-blog-examples/images/profile_scan_sampling_size.png new file mode 100644 index 00000000..2e0fcd6d Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_sampling_size.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_schedule.png b/datascan/datascan-blog-examples/images/profile_scan_schedule.png new file mode 100644 index 00000000..662b3f48 Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_schedule.png differ diff --git a/datascan/datascan-blog-examples/images/profile_scan_scope.png b/datascan/datascan-blog-examples/images/profile_scan_scope.png new file mode 100644 index 00000000..a8bbeb6f Binary files /dev/null and b/datascan/datascan-blog-examples/images/profile_scan_scope.png differ