Pandas | Antigranular Enterprise Docs

Requirements

Before starting the guide, finish the First Steps to Use AGENT guide.

Overview

AGENT provides a comprehensive guide to present all the main functionalities of op_pandas. The guide is divided in four parts, to have a better experience, follow the proposed order:

Importing Data

Learn how to create PrivateDataFrames and PrivateSeries from various sources including AGENT server datasets, pandas objects, and local Jupyter sessions.

Managing Data

Explore data viewing, cleaning, and selection methods for PrivateDataFrames while maintaining privacy constraints and applying transformations.

Operations

Perform unary, binary, and bitwise operations on PrivateDataFrames and PrivateSeries with element-wise transformations and arithmetic calculations.

Functions, Joins and Statistical Methods

Master advanced functionality including concatenation, train-test splits, joins, merges, WHERE queries, and differentially private statistical calculations.

Pandas is a widespread open-source data manipulation and analysis library for Python. It provides data structures and functions to efficiently handle and manipulate structured data, such as tables or time series. Pandas offers powerful data cleaning, transformation, filtering, merging, and aggregation tools. It is widely used in data science, machine learning, and other domains for data preprocessing and analysis tasks, making it a valuable tool for working with structured data in Python.

AGENT provides a differentially private version of the Pandas Library (op_pandas), which lets users handle private data frames and series and perform various statistical analyses with differential privacy guarantees. Users familiar with Pandas will find minimal difficulty adjusting to the API methods.

To use AGENT’s op_pandas, you can import the library as presented in the following code block:

%%ag
from op_pandas import PrivateDataFrame, PrivateSeries

Info

Private datasets can be loaded as PrivateDataFrames and PrivateSeries.

Resources

The following are additional resources that will be helpful when using the Pandas library:

Official Pandas Documentation
API Reference: In addition to the comprehensive guide, you can also check reference pages for PrivateDataFrame, PrivateSeries, General Methods

Importing Data

The op_pandas library allows users to import datasets efficiently. This page will showcase some of the available ways to do so.

Requirements

Before continue following the steps described in this page, be sure to have finished the First Steps to Use AGENT

Object Creation

Users can import and load datasets from different sources, such as:

Importing from the AGENT Jupyter Server

The load_dataset() lets users obtain a dataset and the required data structures from the Antigranular server.

info

Private data structures cannot be exported to the local environment. Unless a differentially private measure is applied to obtain a non-private data frame.

You can use the load_dataset() function to load any dataset, as shown in the following code block:

%%ag
from op_pandas import PrivateDataFrame, PrivateSeries

# Obtaining the dictionary containing private objects
response = load_dataset("<dataset_name>", "<team_name>")

# Response will be a PDF, and will be using the budget allocated to the user from "<team_name>" team.

Importing from a pandas.Series

When creating a PrivateSeries, it's recommended to set metadata bounds to define the range of valid values for the series. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

See an example in the following code block:

%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9] , name='Test_series')
priv_s = PrivateSeries(series=s,metadata=(0,10))

Where:

A pandas Series named 'Test_series' is created with the values [1, 5, 8, 2, 9].
Using the PrivateSeries constructor, we create a private series (priv_s) from the regular pandas Series s.
Metadata bounds (0, 10) are set, ensuring that all values in the series fall within the range from 0 to 10.

By setting metadata bounds, you control the valid range of values within the series, enhancing privacy and security while working with sensitive data.

Importing from a pandas.DataFrame

Just as with PrivateSeries, setting metadata bounds when creating a PrivateDataFrame is recommended. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

See an example in the following code block:

%%ag
import pandas as pd
data = {
    'Age':[20,30,40,25,30,25,26,27,28,29],
    'Salary':[35000,60000,100000,55000, 35000,60000,100000,55000,35000,60000],
    'Sex':['M','F','M','F', 'M','F','M','F', 'M', 'F']
}
metadata = {
    'Age':(18,65),
    'Salary':(20000,200000)
}
categorical_metadata = {
    'Sex':['M','F']
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata, categorical_metadata=categorical_metadata)

In the example:

Data for the DataFrame is defined, including columns for 'Age', 'Salary', and 'Sex'.
Metadata bounds are specified for each column:
- For the 'Age' column, the valid range is set from 18 to 65.
- For the 'Salary' column, the valid range is set from 20000 to 200000.
Categorical Metadata is specified for 'Sex' column.
A pandas DataFrame (df) is created using the defined data.
Using the PrivateDataFrame constructor, a privateDataFrame (priv_df) is created from the pandas.DataFrame df, with specified metadata bounds.

By setting metadata bounds, users can ensure that each column in the dataframe contains values within predefined limits, enhancing data integrity and security.

Importing from the local Jupyter session

Users can import external data from their local Jupyter session within the AG environment. This allows seamless data integration into the AG environment while maintaining privacy and security.

See an example below:

Random data is generated to create two pandas DataFrames, df and df_2, representing different datasets.

import pandas as pd
import numpy as np
import string
import random

# Generate random names, ages, and salaries for the DataFrame
arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
    res = ''.join(random.choices(string.ascii_lowercase, k=N))
    arr_name.append(res)

# Create a DataFrame with random data
df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})

# Import the DataFrame 'df' into the AG environment with the name 'imported_df'
session.private_import(data=df, name='imported_df')

Now the second dataset with NaNs is created:

# Randomly distributing NaNs in two columns with a probability of 0.5

choice = [1, 2, np.nan]
a = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
b = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])

# Create a DataFrame 'df_2' with random data and NaNs
df_2 = pd.DataFrame({'a': a, 'b': b})

# Import the DataFrame 'df_2' into the AG environment with the name 'imported_df_2'
session.private_import(data=df_2, name="imported_df_2")

The private_import function imports these DataFrames into the AG environment with specified names (imported_df and imported_df_2).

Now, Metadata bounds are defined for columns 'age' and 'salary' of the DataFrame imported_df, and a PrivateDataFrame priv_df is created from the DataFrame imported_df, ensuring that the data remains private and secure within the AG environment.

# Create a PrivateDataFrame 'priv_df' from the imported DataFrame 'imported_df'
metadata = {
    'age': (0, 80),     # Define metadata bounds for the 'age' column
    'salary': (1, 200000)   # Define metadata bounds for the 'salary' column
}

priv_df = PrivateDataFrame(imported_df, metadata=metadata)

By leveraging the private_import function and creating PrivateDataFrames, users can seamlessly work with external data while maintaining privacy.

Continue the op_pandas guide.

Access the Managing Data to continue following the op_pandas guide.

Managing Data

This section presents the op_pandas library guide and addresses data management tasks.

Viewing Data

Records in PrivateDataFrame and PrivateSeries cannot be viewed directly to protect privacy. However, users can still analyze and obtain statistical information about the data using methods that offer differential privacy guarantees.

Printing details about the data

Inspect PrivateDataFrame structure using ag_print

To print details about the data, such as columns, metadata, and data types, within the AG environment, users can use the ag_print function. It can quickly inspect the details of their data within the AG environment, facilitating data analysis and exploration.

See the following example:

%%ag

ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)

When executed:

>>> Columns:
Index(['name', 'age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
name      object
age        int64
salary     int64
dtype: object

Generating quick statistics

Generate differentially private statistical summaries using describe() method with epsilon budget

Users can obtain quick statistics about your dataset using the describe() method in pandas. By spending some epsilon, you can get an idea about the statistical details of the dataset.

See the following example:

%%ag

priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)

When executed:

>>>
                age         salary
count  10011.000000   10011.000000
mean      39.430439   49728.643224
std       23.065769   28405.891863
min        0.000000     583.443063
25%       19.904847   24461.587152
50%       35.801599   50159.255936
75%       60.452492   74499.712660
max       77.701005  147580.823075

Statistics can be viewed by exporting the non-private result to the local Jupyter server:

%%ag

export(priv_describe, name='priv_describe')

Setting up exported variable in local environment: priv_describe

print(priv_describe)

>>>
                age         salary
count  10011.000000   10011.000000
mean      39.430439   49728.643224
std       23.065769   28405.891863
min        0.000000     583.443063
25%       19.904847   24461.587152
50%       35.801599   50159.255936
75%       60.452492   74499.712660
max       77.701005  147580.823075

Cleaning Data

Users can use the dropna method to remove any records that contain NaN values in any of its features.

%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})

export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')

>>> Setting up exported variable in local environment: result

/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._df[col].clip(lower=self._metadata[col][0], upper=self._metadata[col][1], inplace=True)

print(result)

>>>
                 a            b
count  2490.000000  2490.000000
mean      1.495172     1.502561
std       0.496467     0.483317
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.264635     1.668942
75%       1.985666     1.975653
max       1.997052     1.993214

print(df_2.dropna(axis=0).describe())

>>>
                 a            b
count  2490.000000  2490.000000
mean      1.492369     1.503213
std       0.500042     0.500090
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.000000     2.000000
75%       2.000000     2.000000
max       2.000000     2.000000

Columns from the PrivateDataFrame can also be dropped using the drop functionality, just like this:

%%ag
ag_print("Columns before: ", priv_df.columns)

temp_df = priv_df.drop(['name'], inplace = False)

ag_print("Columns After Dropping name: ", temp_df.columns)

Response:

>>> Columns before:  Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name:  Index(['age', 'salary'], dtype='object')

Selecting Data

Setting values that affect a particular set of records is not allowed in PrivateDataFrame or PrivateSeries. However, transformation functions can be applied using PrivateDataFrame.ApplyMap or PrivateSeries.Map.

Getting Data Objects from a single column

Select individual columns from PrivateDataFrame to create PrivateSeries objects with preserved metadata

To select a single column from a DataFrame and obtain a PrivateSeries, equivalent to df["A"], use the following approach:

%%ag

# Select the 'age' column from the PrivateDataFrame 'priv_df' and obtain a PrivateSeries
priv_s = priv_df['age']

# Print quick statistics and metadata of the PrivateSeries 'priv_s'
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)

When executed:

>>> Describe:
count    9988.000000
mean       39.472473
std        22.735391
min         0.000000
25%        21.384668
50%        37.840602
75%        55.273188
max        78.730881
Name: series, dtype: float64
Metadata:
(0, 80)

In this example:

The priv_df['age'] syntax is used to select the 'age' column from the PrivateDataFrame priv_df, resulting in a PrivateSeries priv_s.
The describe() method is applied to the PrivateSeries priv_s with an epsilon value of 1 to obtain quick statistics about the data.
The output displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for the 'age' column.
The metadata bounds for the 'age' column are also printed, indicating that valid values range from 0 to 80.

Getting data objects from various columns

Select multiple columns from PrivateDataFrame to create filtered PrivateDataFrame subsets

To select a collection of columns from a DataFrame and obtain a PrivateDataFrame, you can use the following approach:

%%ag

# Select the 'age' and 'salary' columns from the PrivateDataFrame 'priv_df' and obtain a PrivateDataFrame 'priv_df_filtered'
priv_df_filtered = priv_df[['age', 'salary']]

# Print the columns and metadata of the PrivateDataFrame 'priv_df_filtered'
ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)

When executed:

>>> Columns:
Index(['age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}

In this example:

The priv_df[['age', 'salary']] syntax is used to select the 'age' and 'salary' columns from the PrivateDataFrame priv_df, resulting in a PrivateDataFrame priv_df_filtered.
The columns attribute of the PrivateDataFrame priv_df_filtered is printed to display the selected columns.
The metadata attribute of the PrivateDataFrame priv_df_filtered is printed to show the metadata bounds for each selected column.

Applying transformation functions

Transform PrivateDataFrame and PrivateSeries data using applymap and map functions with custom transformation logic

To use Applymap on a PrivateDataFrame, use the following approach:

%%ag

# Define a transformation function 'func' to map strings to their lengths and numerical values to their halves
def func(x: str | int | float) -> float:
    if isinstance(x, str):
        return len(x)
    elif isinstance(x, (int, float)):
        return x / 2
    return 0.0

# Apply the transformation function 'func' to each element of the PrivateDataFrame 'priv_df' using applymap
result = priv_df.applymap(func, eps=1)

# Print the metadata and quick statistics of the resulting PrivateDataFrame 'result'
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps=1))

When executed:

>>> Metadata:
{'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}

Describe:
                name           age        salary
count  10001.000000  10001.000000  10001.000000
mean       9.979915     19.763231  25068.695612
std        0.274668     10.477926  16333.537130
min        8.000000      0.000000    686.474307
25%       10.000000      8.740750  12489.945476
50%       10.000000     18.864365  24417.320448
75%       10.000000     28.283437  37265.078587
max       10.000000     35.126989  60705.413286

In this example:

The func function is defined to map strings to their lengths and numerical values to their halves.
The applymap() function is used to apply the transformation function func to each element of the PrivateDataFrame priv_df.
The resulting PrivateDataFrame result contains transformed values.
The metadata of the resulting PrivateDataFrame result is printed, showing the updated metadata bounds.
Quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the PrivateDataFrame result are printed, providing insights into the transformed data.

To use applymap on a PrivateSeries. The mapping can be done using a dictionary for 1:1 mapping or via a callable method.

%%ag

# Define a mapping function 'series_map' to halve numerical values
def series_map(x: int) -> float:
    return x / 2

# Apply the mapping function 'series_map' to the 'age' column of the PrivateDataFrame 'priv_df'
priv_df['age'] = priv_df['age'].map(series_map, eps=1)

# Print the metadata of the updated 'age' column in 'priv_df'
ag_print("Metadata:\n", priv_df.metadata)

When executed:

>>> Metadata:
{'age': (0.0, 64.0), 'salary': (1, 200000)}

In this example:

The series_map function is defined to halve numerical values.
The map() function is used to apply the mapping function series_map to the 'age' column of the PrivateDataFrame priv_df.
The metadata of the 'age' column in the updated PrivateDataFrame priv_df is printed, showing the updated metadata bounds.

Continue the op_pandas guide.

See the Operations page to continue following the op_pandas guide.

Operations

This section continues by presenting the op_pandas library guide and addressing some of the available operations you can perform on PrivateDataFrame and PrivateSeries objects.

Unary Ops

Users can perform unary operations such as ~, -, +, and abs on PrivateDataFrames and PrivateSeries. These operations apply element-wise to the data.

Operator	Description
`~`	The `~` operator performs the bitwise negation operation.
`-`	The `-` operator performs the arithmetic negation operation.
`+`	The `+` operator performs the arithmetic addition operation.
`abs()`	The `abs()` function calculates the absolute value of each element.

See the following example:

%%ag

# Export the quick statistics of the original PrivateDataFrame 'priv_df_2' and its negative counterpart
export(priv_df_2.describe(eps=2), 'original')
export((-priv_df_2).describe(eps=2), 'negative')

When executed:

>>>
Setting up exported variable in local environment: original
Setting up exported variable in local environment: negative

# Rename columns of the negative DataFrame for clarity
negative.columns = ["a_neg", "b_neg"]

# Join the original and negative DataFrames and print the result
print(original.join(negative, how="left"))

Output:

>>>
                  a             b         a_neg         b_neg
count  10000.000000  10000.000000  10000.000000  10000.000000
mean       1.498597      1.504388     -1.500690     -1.503697
std        0.494197      0.498858      0.498402      0.499788
min        1.000000      1.000000     -1.000000     -1.000000
25%        1.000000      1.000000     -1.001932     -1.001540
50%        1.635791      1.891447     -1.678784     -1.161680
75%        1.991538      1.997409     -1.996417     -1.996500
max        1.992424      1.997140     -1.999776     -1.999894

Where:

The quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the original PrivateDataFrame priv_df_2 and its negative counterpart are exported to the local environment.
The negative DataFrame is created by applying the unary - operator to the original PrivateDataFrame priv_df_2.
The columns of the negative DataFrame are renamed for clarity.
The original and negative DataFrames are joined together, and the result is printed, showing the element-wise application of the unary - operator.

Binary Ops

Users can apply binary operations using scalars and PrivateDataFrames against PrivateDataFrames. See the example below:

%%ag

# Select the 'age' and 'salary' columns from the PrivateDataFrame 'priv_df' and obtain a PrivateDataFrame 'pdf'
pdf = priv_df[['age', 'salary']]

# Perform binary operations on 'pdf' with a mix of scalars and 'pdf' itself
result1 = pdf + (10 * pdf)  # Expected min-max: Age: (0, 704), Salary: (11, 2200000)
result2 = result1 / 1000    # Expected min-max: Age: (0, 0.704), Salary: (0.011, 2200)

# Print the metadata of the resulting PrivateDataFrames 'result1' and 'result2'
ag_print("Result1 metadata: \n", result1.metadata)
ag_print("Result2 metadata: \n", result2.metadata)

When executed:

>>>
Result1 metadata:
 {'age': (0.0, 704.0), 'salary': (11, 2200000)}
Result2 metadata:
 {'age': (0.0, 0.704), 'salary': (0.011, 2200.0)}

In it:

The 'age' and 'salary' columns are selected from the PrivateDataFrame priv_df to create a new PrivateDataFrame pdf.
Binary operations are performed on pdf using a mix of scalars and pdf.
result1 is obtained by adding pdf with 10 times pdf, and result2 is obtained by dividing result1 by 1000.
The resulting PrivateDataFrames result1 and result2 metadata are printed, showing the updated metadata bounds after the binary operations.

Bitwise Ops

Users can apply bitwise operations using scalars and PrivateDataFrames against PrivateDataFrames. These operations apply element-wise to the data.

See the following example:

%%ag
import numpy as np
import pandas as pd

# Create two PrivateSeries with randomly sampled integer data containing values in the range (0,1)
priv_ser_1 = PrivateSeries(pd.Series(np.random.randint(0, 2, 10000)), metadata=(0, 1))
priv_ser_2 = PrivateSeries(pd.Series(np.random.randint(0, 2, 10000)), metadata=(0, 1))

# Print the description of the first PrivateSeries
ag_print("Describe of private Series 1: \n", priv_ser_1.describe(eps=1))

# Print the description of the second PrivateSeries
ag_print("Describe of private Series 2: \n", priv_ser_2.describe(eps=1))

# Apply the bitwise AND operation between priv_ser_1 and priv_ser_2 and store the result in 'result'
result = priv_ser_1 & priv_ser_2

# Print the description of the resulting PrivateSeries
ag_print("Describe of the result: \n", result.describe(eps=1))

When executed:

>>>
Describe of private Series 1:
 count    9.998000e+03
mean     1.571300e-03
std      1.998231e-02
min      0.000000e+00
25%      4.656613e-10
50%      4.656613e-10
75%      4.656613e-10
max      4.656613e-10
Name: series, dtype: float64

Describe of private Series 2:
 count    1.000500e+04
mean     5.608570e-04
std      4.612496e-02
min      0.000000e+00
25%      4.656613e-10
50%      4.656613e-10
75%      4.656613e-10
max      4.656613e-10
Name: series, dtype: float64

Describe of the result:
 count    1.000700e+04
mean     3.952059e-04
std      2.277582e-02
min      0.000000e+00
25%      4.656613e-10
50%      4.656613e-10
75%      4.656613e-10
max      4.656613e-10
Name: series, dtype: float64

In it:

Two PrivateSeries priv_ser_1 and priv_ser_2 are created with randomly sampled integer data containing values in the range (0,1).
The descriptions of both PrivateSeries are printed, displaying the count, mean, std, min, 25%, 50%, 75%, and max values.
The bitwise AND operation (&) is applied between priv_ser_1 and priv_ser_2, and the result is stored in result.
The description of the resulting PrivateSeries result is printed, showing the statistics of the element-wise bitwise AND operation.

Continue the op_pandas guide.

See the Functions, Joins and Statistical Methods page to continue following the op_pandas guide.

Functions, Joins and Statistical Methods

This section continue presenting the op_pandas library guide, addressing functions, joins and statistical methods.

General Functions

op_pandas comes packaged with some useful functions such as:

`concat`

Concatenate multiple PrivateDataFrames vertically over indices with same column datatypes

Users can concatenate PrivateDataFrames, by creating a copy of the dataframe and then concatenating it with the original. See the example below:

First, create a DataFrame df with random data:

arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
res = ''.join(random.choices(string.ascii_lowercase, k=N))
arr_name.append(res)

df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})

session.private_import(data=df, name='imported_df_copy')

>>> dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel

Next, create a PrivateDataFrame priv_df_copy using the imported DataFrame:

%%ag
metadata = {
    'age': (0, 80),
    'salary': (1, 200000)
}

priv_df_copy = PrivateDataFrame(imported_df_copy, metadata=metadata)

Then, print the counts of records in the original copied PrivateDataFrames, and concatenate them:

%%ag
ag_print("First DF count:\n", priv_df.count(eps=1))
ag_print("Second DF count:\n", priv_df_copy.count(eps=1))

import op_pandas
concat_df = op_pandas.concat([priv_df, priv_df_copy])
ag_print("Concat DF count: \n", concat_df.count(eps=1))

The resulting output should be as follows:

>>>
First DF count:
name      10000
age        9999
salary     9988
dtype: int64

Second DF count:
name       9998
age        9999
salary    10001
dtype: int64

Concat DF count:
name      19997
age       20000
salary    20004
dtype: int64

concat is only used to concatenate PrivateDataFrames over indices.

Datatypes

The datatypes of the same columns must be the same across all data frames.

`train_test_split`

Split a PrivateDataFrame into training and testing subsets for machine learning workflows

Users can utilize op_pandas.train_test_split to divide a PrivateDataFrame into train and test splits. As in the example below:

%%ag

priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)

ag_print("Count of train split:\n", priv_df_train.count(eps=1))
ag_print("Count of test split:\n", priv_df_test.count(eps=1))

Output when executed:

>>
Count of train split:
name      7501
age       7503
salary    7500
dtype: int64

Count of test split:
name      2492
age       2499
salary    2491
dtype: int64

In this example, op_pandas.train_test_split splits the PrivateDataFrame priv_df into train and test sets. The counts of records in each split are printed, showing the distribution of data between the training and testing subsets.

Join and Where

This section explores different operations related to joining PrivateDataFrame and PrivateSeries objects and using the WHERE and JOIN method for filtering data.

`JOIN` on Private objects

Joining two PrivateDataframes

%%ag
"""
pdf: {randomly sampled}
    columns = ['A','B','C']
    metadata = (0,100) for all columns
    size = 1000

pdf2: {randomly sampled}
    columns = ['E','F']
    metadata = (-100,100) for all columns
    size = 1000

ps: {randomly sampled series}
    name = 'series'
    metadata = (0,100)
    size = 1000
"""

pdf = PrivateDataFrame(pd.DataFrame({"A": np.random.randint(0,100,1000), "B": np.random.randint(0,100,1000), "C": np.random.randint(0,100,1000)}), metadata = {"A": (0, 100), "B": (0, 100), "C": (0, 100)} , foreign_keys={"A": (2, "")} )
pdf2 = PrivateDataFrame(pd.DataFrame({"E": np.random.randint(-100,100,1000), "F": np.random.randint(-100,100,1000)}), metadata = {"E": (-100, 100), "F": (-100, 100)} , foreign_keys={"E": (1, "")} )
ps = PrivateSeries(pd.Series(np.random.randint(0,100,1000)), metadata=(0,100) , foreign_keys={"A": (2, "")} )

*foregin keys should be specified on the metadata with maximum frequency attribute.
*on parameter must be provided in join , if you want to join to dataframe with different column name use merge.

%%ag #join with dataframe
result = pdf.join(pdf2,on="A" , how="outer")
ag_print("Outer Join of two PDFs Describe:\n",result.describe(eps=1))

>>> Outer Join of two PDFs Describe:
                A           B           C           A           F
count  978.000000  978.000000  978.000000  978.000000  978.000000
mean    45.078161   48.373313   47.395728    2.918509    0.379322
std     41.659230   39.782047    3.978732   55.174294   76.570248
min      0.000000    0.000000    0.000000 -100.000000 -100.000000
25%     25.835038   16.814188   13.689315  -55.716763  -79.960320
50%     45.534085   55.195066   57.023430    3.266460   -5.226773
75%     62.600983   63.868197   68.137527   40.907116   58.650339
max     93.493320   94.355744   79.755250   58.467962   89.288805

Joining PrivateDataframe with PrivateSeries:

%%ag # join with series
result = pdf.join(ps, on="A" , how="inner")
ag_print("Inner Join of PDF and Series Describe:\n",result.describe(eps=1))

>>> Inner Join of PDF and Series Describe:
                A           B           C      series
count  995.000000  995.000000  995.000000  995.000000
mean    50.461822   50.737590   44.956548   52.693788
std     28.974945   35.573564   28.876807   39.639266
min      0.000000    0.000000    0.000000    0.000000
25%     11.820072   43.155183   48.520076    8.472153
50%     65.126290   35.547530   28.150633   60.640127
75%     89.179010   85.479518   50.462041   68.068637
max     97.001126   79.362787   92.063356   75.326474

`WHERE` Query

Filter PrivateDataFrames based on conditional expressions

%%ag

result = pdf.where(pdf > 0)
ag_print("Where Query Result:\n", result.describe(eps=1))

>>> Where Query Result:
                A            B            C
count  1001.000000  1001.000000  1001.000000
mean     48.506532    50.754384    53.904313
std      42.434438    24.036048    27.686005
min       0.000000     0.000000     0.000000
25%      23.367422    21.512664     4.402222
50%      33.757446    52.456891    22.643254
75%      65.482272    81.311059    74.171133
max      92.626234    84.918578    98.803625

Merge on Private objects

Merging two PrivateDataFrames:

%%ag
"""
pdf1: {randomly sampled}
    columns = ['A','B','C']
    metadata = (0,100) for all columns
    size = 1000

pdf2: {randomly sampled}
    columns = ['E','D','C']
    metadata = (0,100) for all columns
    size = 1000
"""

pdf1 = PrivateDataFrame(pd.DataFrame({"A": np.random.randint(0,100,1000), "B": np.random.randint(0,100,1000), "C": np.random.choice(['p', 'q', 'r'], 1000)}), metadata = {"A": (0, 100), "B": (0, 100), "C": (0, 100)} , foreign_keys={"A": (1, "table1")} )
pdf2 = PrivateDataFrame(pd.DataFrame({"E": np.random.randint(0,100,1000), "D": np.random.randint(0,100,1000), "C": np.random.choice(['r', 's', 'p'], 1000)}), metadata = {"E": (0, 100), "D": (0, 100), "C": (0, 100)} , foreign_keys={"E": (3, "table2")} )

%%ag #merge with dataframe
merged_df = pdf1.merge(pdf2, how="right", left_on="A", right_on="E")
ag_print("Merged DataFrame Describe:\n", merged_df.describe(eps=1))

>>> Merged DataFrame Describe:
                 A           B           C           E           D
count  978.000000  978.000000  978.000000  978.000000  978.000000
mean    45.078161   48.373313   47.395728    2.918509    0.379322
std     41.659230   39.782047    3.978732   55.174294   76.570248
min      0.000000    0.000000    0.000000 -100.000000 -100.000000
25%     25.835038   16.814188   13.689315  -55.716763  -79.960320
50%     45.534085   55.195066   57.023430    3.266460   -5.226773
75%     62.600983   63.868197   68.137527   40.907116   58.650339
max     93.493320   94.355744   79.755250   58.467962   89.288805

In this example, the merge method is used to combine pdf1 and pdf2 on columns A and E with a right join. The resulting merged DataFrame is described using the describe method with an epsilon value of 1 to ensure differential privacy.

Statistical Methods

PrivateDataframe and PrivateSeries support various statistical methods with APIs similar to pandas.DataFrame and pandas. Series, such as:

sum, mean, count
std, var
quantile, percentile , median
bounded min and max can be achieved using percentile(0) and percentile(100) respectively.
hist and hist2d.
cov (covariance), skew (skewness), corr (correlation).

Basic Statistics

Calculate basic differentially private statistics like variance, count, and percentiles on PrivateSeries

Users can generate statistical insights about the data while preserving privacy using differentially private methods such as variance, count, percentile on a PrivateSeries. See the following example:

%%ag
var = ps.var(eps=1)
count = ps.count(eps=1)

ag_print(f"variance = {var} , count = {count}")

>>>
variance = 806.7200376847192 , count = 100

In it, the differentially private variance (var) and count (count) of the PrivateSeries ps are calculated with an epsilon value of 1. These values provide statistical insights about the data while preserving differential privacy.

It is possible to find the bounded max and min values in the following way:

%%ag
min = ps.percentile(eps=0.1, p=0)
max = ps.percentile(eps=0.1, p=100)

ag_print(f"min = {min} , max = {max}") ### Actual min, max : (0,100)

>>>
min = 3.7421055945821973 , max = 99.0959021484033

In this example, the differentially private bounded minimum (min) and maximum (max) of the PrivateSeries ps are calculated using the percentile method with an epsilon value of 0.1 and percentiles 0 and 100, respectively. These values provide bounds on the minimum and maximum values in the dataset while preserving differential privacy.

Advanced Statistics

Perform complex statistical calculations including covariance, correlation, and skewness with differential privacy

Users can perform complex differentially private statistical calculations such as:

Covariance
Skewness
Correlation

Below, we demonstrate finding the correlation matrix of a PrivateDataFrame and compare the results with the expected initial result:

The correlation matrix of the PrivateDataFrame priv_df is calculated with an epsilon value of 3 using the corr method.

The resulting correlation matrix is exported to the local environment and printed.

%%ag

# Calculate the correlation matrix of the PrivateDataFrame priv_df with an epsilon value of 3
result = priv_df.corr(eps=3)

# Export the result to the local environment
export(result, 'private_result')

>>>
Setting up exported variable in local environment: private_result

# Print the calculated correlation matrix
print(private_result)

>>>            age    salary
age          1.0  0.032151
salary  0.032151       1.0

The correlation matrix of the original DataFrame df[['age', 'salary']] is calculated using pandas.

# Print the correlation matrix of the original DataFrame
print(df[['age', 'salary']].corr())

>>>             age    salary
age     1.000000  0.006333
salary  0.006333  1.000000

The calculated correlation matrices from both the private and original datasets are compared.

Histograms

Generate differentially private histograms from PrivateDataFrames for data visualization and distribution analysis

Users can generate differentially private histograms using PrivateDataFrame.hist().

%%ag

hist_data = priv_df.hist(column='salary',eps=0.1)
export(hist_data , 'hist_data')

>>> Setting up exported variable in local environment: hist_data

Users can use Matplotlib or any other plotting library to visualize the histogram locally.

import matplotlib.pyplot as plt
dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins)*0.8, align='edge')

# Display the plot
plt.show()

Overview​

Importing Data​

Object Creation​

Importing from the AGENT Jupyter Server​

Importing from a pandas.Series​

Importing from a pandas.DataFrame​

Importing from the local Jupyter session​

Managing Data​

Viewing Data​

Printing details about the data​

Generating quick statistics​

Cleaning Data​

Selecting Data​

Getting Data Objects from a single column​

Getting data objects from various columns​

Applying transformation functions​

Operations​

Unary Ops​

Binary Ops​

Bitwise Ops​

Functions, Joins and Statistical Methods​

General Functions​

concat​

train_test_split​

Join and Where​

JOIN on Private objects​

WHERE Query​

Merge on Private objects​

Statistical Methods​

Basic Statistics​

Advanced Statistics​

Histograms​

Overview

Importing Data

Object Creation

Importing from the AGENT Jupyter Server

Importing from a pandas.Series

Importing from a pandas.DataFrame

Importing from the local Jupyter session

Managing Data

Viewing Data

Printing details about the data

Generating quick statistics

Cleaning Data

Selecting Data

Getting Data Objects from a single column

Getting data objects from various columns

Applying transformation functions

Operations

Unary Ops

Binary Ops

Bitwise Ops

Functions, Joins and Statistical Methods

General Functions

`concat`

`train_test_split`

Join and Where

`JOIN` on Private objects

`WHERE` Query

Merge on Private objects

Statistical Methods

Basic Statistics

Advanced Statistics

Histograms