Categorical Detection:
Multi-Table Metadata Extraction
Differentially Private Metadata Extraction
Overview
Metadata from multiple database tables can be extracted with or without Differential Privacy (DP). When DP is enabled, both value bounds and categorical information are extracted in a privacy-preserving way.
Value Bounds:
Differential Privacy Parameters
Controls the privacy-accuracy tradeoff.
Represents the probability of a privacy failure. A smaller δ means a lower chance of privacy leakage
For category detection and extraction to function correctly, δ must be set to a non-zero value. Setting δ to zero will disable these features.
The total DP budget is distributed across all columns in all tables and then further divided between categorical detection and bounds calculations. For databases with many columns, allocating a larger DP budget is recommended to maintain accuracy.
Categorical Values and Bounds
- If a column is identified as categorical, its set of unique category values is extracted.
- Otherwise, bounds are extracted:
- For numeric and datetime columns: minimum and maximum values.
- For string columns: minimum and maximum string lengths.
- Only one of these (categories or bounds) is extracted per column, depending on its type.
Category Detection with DP
For category identification with DP, the following steps are performed:
A query is executed to calculate the distinct values and their counts.
A Laplace distribution is created using the given DP parameters and max_subject_references.
Using the threshold from the Laplace distribution, the distinct values are filtered to those whose counts exceed the threshold.
Bounds Detection with DP
For bounds identification with DP, the following steps are performed:
A query is created where all values are mapped into predefined bound buckets.
Once the values are mapped, their counts are computed in the same query (equivalent to a histogram with predefined bounds).
Laplace noise is added to each count with scale proportional to max_subject_references / epsilon.
Using the threshold from the Laplace distribution, results are filtered to those whose counts exceed the threshold.
Minimum and maximum values (edges) are then identified from the remaining values.