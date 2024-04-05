



This document describes the ML.VALIDATE_DATA_DRIFT function that can be used to calculate data drift between two sets of provided data. This function calculates and compares the statistics of two data sets and identifies where there are unusual differences between the two data sets. For example, you can compare current search data to historical search data in table snapshots, or to features that were available at a specific point in time, which you can retrieve using the ML.FEATURES_AT_TIME function. The data output by this function can be used for model monitoring.

Syntax ML.VALIDATE_DATA_DRIFT( { TABLE `project_id.dataset.base_table` | (base_query_statement) }, { TABLE `project_id.dataset.study_table` | (study_query_statement) }, STRUCT(

[num_histogram_buckets AS num_histogram_buckets]

[, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]

[, num_values_histogram_buckets, AS num_values_histogram_buckets,]

[, num_rank_histogram_buckets AS num_rank_histogram_buckets]

[, categorical_default_threshold AS categorical_default_threshold]

[, categorical_metric_type AS categorical_metric_type]

[, numerical_default_threshold AS numerical_default_threshold]

[, numerical_metric_type AS numerical_metric_type]

[, thresholds AS thresholds]) ) argument

ML.VALIDATE_DATA_DRIFT accepts the following arguments:

project_id: Project ID. Dataset: The BigQuery dataset that contains your model. Model: Name of the model. base_table: Name of the input table of service delivery data to use as the baseline for comparison. base_query_statement: A query that generates service data to use as a baseline for comparison. For supported SQL syntax for the base_query_statement clause, see GoogleSQL query syntax. Study_table: Name of the input table containing the provided data to compare with the baseline. Study_query_statement: A query that generates provided data to compare against the baseline. For supported SQL syntax for the study_query_statement clause, see GoogleSQL query syntax. num_histogram_buckets: An INT64 value that specifies the number of buckets to use for a histogram of equal-width buckets.Applies only to numbers, ARRAY and array > column. num_histogram_buckets value must be in range

[1, 1,000]. Default value is 10. num_quantiles_histogram_buckets: An INT64 value that specifies the number of buckets to use for the quantile histogram.Applies only to numbers, ARRAY and array > column. num_quantiles_histogram_buckets value must be within range

[1, 1,000]. Default value is 10. num_values_histogram_buckets: An INT64 value that specifies the number of buckets to use for the quantile histogram. Applies only to ARRAY columns. num_values_histogram_buckets value must be within range

[1, 1,000]. Default value is 10. num_rank_histogram_buckets: An INT64 value that specifies the number of buckets to use for the rank histogram.Applies only to categorical and ARRAY Column. num_rank_histogram_buckets value must be within range [1, 10,000]. Default value is 50. categorical_default_threshold: FLOAT64 value that specifies a custom threshold for categorical and ARRAY anomaly detection Features.Value must be within range [0, 1). The default value is 0.3.

categorical_metric_type: a STRING value that specifies the metric used

to compare statistics for categorical and

ARRAY features. Valid values are as follows:

numerical_default_threshold: a FLOAT64 value that specifies the custom

threshold to use for anomaly detection for numerical features. The value

must be in the range [0, 1). The default value is 0.3.

numerical_metric_type: a STRING value that specifies the metric used

to compare statistics for numerical, ARRAY , and

ARRAY > features. The only valid value is

JENSEN_SHANNON_DIVERGENCE.

thresholds: an ARRAY > value

that specifies the anomaly detection thresholds for one or more columns

for which you don’t want to use the default threshold. The STRING value in

the struct specifies the column name, and the FLOAT64 value specifies the

threshold. The FLOAT64 value must be in the range [0,1). For example,

[(‘col_a’, 0.1), (‘col_b’, 0.8)].output

ML.VALIDATE_DATA_DRIFT returns one row for each column of input data. ML.VALIDATE_DATA_DRIFT output includes the following columns:

input: STRING column containing the input column name. metric: A STRING column containing the metric used to compare statistics of input columns between two data sets. The value of this column is JENSEN_SHANNON_DIVERGENCE for numeric features and L_INFTY or JENSEN_SHANNON_DIVERGENCE for categorical features. Threshold: A FLOAT64 column containing the threshold value used to determine whether a statistical difference in input column values ​​between two data sets is abnormal. value: FLOAT64 column containing the statistical differences in input column values ​​between the two datasets. is_anomaly: BOOL column that indicates whether the value value is higher than the threshold.example

The following example uses a categorical feature threshold of 0.2 to calculate the data drift between a snapshot of the served data table and the current served data table.

SELECT * FROM ML.VALIDATE_DATA_DRIFT( TABLE `myproject.mydataset.previous_serving_data`, TABLE `myproject.mydataset.serving`, STRUCT(0.2 AS categorical_default_threshold) ); Limitations

ML.VALIDATE_DATA_DRIFT does not perform schema validation between the two input data sets, so it handles type mismatches as follows:

If you specify JENSEN_SHANNON_DIVERGENCE for the categorical_default_threshold or numerical_default_threshold argument, that feature will not be included in the final anomaly report. If you specify L_INFTY for the categorical_default_threshold argument, the function outputs the calculated feature distances as expected.

However, when performing inference on the provided data, the ML.PREDICT function handles schema validation.

