Differentially Private Metadata Extraction

Overview

Metadata from multiple database tables can be extracted with or without Differential Privacy (DP). When DP is enabled, both value bounds and categorical information are extracted in a privacy-preserving way.

Categorical Detection:

Columns are identified as categorical using DP techniques. If detected, category values are also extracted with DP guarantees.

Value Bounds:

Numeric and datetime columns have their value ranges extracted with DP. String columns have their length bounds extracted similarly.

Differential Privacy Parameters

Epsilon (ε):

Controls the privacy-accuracy tradeoff.

Delta (δ):

Represents the probability of a privacy failure. A smaller δ means a lower chance of privacy leakage

Note

For category detection and extraction to function correctly, δ must be set to a non-zero value. Setting δ to zero will disable these features.

The total DP budget is distributed across all columns in all tables and then further divided between categorical detection and bounds calculations. For databases with many columns, allocating a larger DP budget is recommended to maintain accuracy.

Categorical Values and Bounds

If a column is identified as categorical, its set of unique category values is extracted.
Otherwise, bounds are extracted:
- For numeric and datetime columns: minimum and maximum values.
- For string columns: minimum and maximum string lengths.
Only one of these (categories or bounds) is extracted per column, depending on its type.

Category Detection with DP

For category identification with DP, the following steps are performed:

Distinct Value Identification

A query is executed to calculate the distinct values and their counts.

Laplace Distribution

A Laplace distribution is created using the given DP parameters and max_subject_references.

Thresholding results

Using the threshold from the Laplace distribution, the distinct values are filtered to those whose counts exceed the threshold.

Bounds Detection with DP

For bounds identification with DP, the following steps are performed:

Query Preparation

A query is created where all values are mapped into predefined bound buckets.

Histogram

Once the values are mapped, their counts are computed in the same query (equivalent to a histogram with predefined bounds).

Laplace Noise

Laplace noise is added to each count with scale proportional to max_subject_references / epsilon.

Thresholding results

Using the threshold from the Laplace distribution, results are filtered to those whose counts exceed the threshold.

Final results

Minimum and maximum values (edges) are then identified from the remaining values.

Overview​

Differential Privacy Parameters​

Categorical Values and Bounds​

Category Detection with DP​

Bounds Detection with DP​

Overview

Differential Privacy Parameters

Categorical Values and Bounds

Category Detection with DP

Bounds Detection with DP