Skip to main content

Pandas

Requirements

Before starting the guide, finish the First Steps to Use AGENT guide.

Overview

AGENT provides a comprehensive guide to present all the main functionalities of op_pandas. The guide is divided in four parts, to have a better experience, follow the proposed order:

Pandas is a widespread open-source data manipulation and analysis library for Python. It provides data structures and functions to efficiently handle and manipulate structured data, such as tables or time series. Pandas offers powerful data cleaning, transformation, filtering, merging, and aggregation tools. It is widely used in data science, machine learning, and other domains for data preprocessing and analysis tasks, making it a valuable tool for working with structured data in Python.

AGENT provides a differentially private version of the Pandas Library (op_pandas), which lets users handle private data frames and series and perform various statistical analyses with differential privacy guarantees. Users familiar with Pandas will find minimal difficulty adjusting to the API methods.

To use AGENT’s op_pandas, you can import the library as presented in the following code block:

%%ag
from op_pandas import PrivateDataFrame, PrivateSeries
Info

Private datasets can be loaded as PrivateDataFrames and PrivateSeries.

Resources

The following are additional resources that will be helpful when using the Pandas library:

Importing Data

The op_pandas library allows users to import datasets efficiently. This page will showcase some of the available ways to do so.

Requirements

Before continue following the steps described in this page, be sure to have finished the First Steps to Use AGENT

Object Creation

Users can import and load datasets from different sources, such as:

Importing from the AGENT Jupyter Server

The load_dataset() lets users obtain a dataset and the required data structures from the Antigranular server.

info

Private data structures cannot be exported to the local environment. Unless a differentially private measure is applied to obtain a non-private data frame.

You can use the load_dataset() function to load any dataset, as shown in the following code block:

%%ag
from op_pandas import PrivateDataFrame, PrivateSeries

# Obtaining the dictionary containing private objects
response = load_dataset("<dataset_name>", "<team_name>")

# Response will be a PDF, and will be using the budget allocated to the user from "<team_name>" team.

Importing from a pandas.Series

When creating a PrivateSeries, it's recommended to set metadata bounds to define the range of valid values for the series. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

See an example in the following code block:

%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9] , name='Test_series')
priv_s = PrivateSeries(series=s,metadata=(0,10))

By setting metadata bounds, you control the valid range of values within the series, enhancing privacy and security while working with sensitive data.

Importing from a pandas.DataFrame

Just as with PrivateSeries, setting metadata bounds when creating a PrivateDataFrame is recommended. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

See an example in the following code block:

%%ag
import pandas as pd
data = {
'Age':[20,30,40,25,30,25,26,27,28,29],
'Salary':[35000,60000,100000,55000, 35000,60000,100000,55000,35000,60000],
'Sex':['M','F','M','F', 'M','F','M','F', 'M', 'F']
}
metadata = {
'Age':(18,65),
'Salary':(20000,200000)
}
categorical_metadata = {
'Sex':['M','F']
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata, categorical_metadata=categorical_metadata)

In the example:

  • Data for the DataFrame is defined, including columns for 'Age', 'Salary', and 'Sex'.
  • Metadata bounds are specified for each column:
    • For the 'Age' column, the valid range is set from 18 to 65.
    • For the 'Salary' column, the valid range is set from 20000 to 200000.
  • Categorical Metadata is specified for 'Sex' column.
  • A pandas DataFrame (df) is created using the defined data.
  • Using the PrivateDataFrame constructor, a privateDataFrame (priv_df) is created from the pandas.DataFrame df, with specified metadata bounds.

By setting metadata bounds, users can ensure that each column in the dataframe contains values within predefined limits, enhancing data integrity and security.

Importing from the local Jupyter session

Users can import external data from their local Jupyter session within the AG environment. This allows seamless data integration into the AG environment while maintaining privacy and security.

See an example below:

  1. Random data is generated to create two pandas DataFrames, df and df_2, representing different datasets.

    import pandas as pd
    import numpy as np
    import string
    import random

    # Generate random names, ages, and salaries for the DataFrame
    arr_name = []
    n_num = 10000
    N = 10
    for i in range(n_num):
    res = ''.join(random.choices(string.ascii_lowercase, k=N))
    arr_name.append(res)

    # Create a DataFrame with random data
    df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})

    # Import the DataFrame 'df' into the AG environment with the name 'imported_df'
    session.private_import(data=df, name='imported_df')

    Now the second dataset with NaNs is created:

    # Randomly distributing NaNs in two columns with a probability of 0.5

    choice = [1, 2, np.nan]
    a = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
    b = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])

    # Create a DataFrame 'df_2' with random data and NaNs
    df_2 = pd.DataFrame({'a': a, 'b': b})

    # Import the DataFrame 'df_2' into the AG environment with the name 'imported_df_2'
    session.private_import(data=df_2, name="imported_df_2")

    The private_import function imports these DataFrames into the AG environment with specified names (imported_df and imported_df_2).

  2. Now, Metadata bounds are defined for columns 'age' and 'salary' of the DataFrame imported_df, and a PrivateDataFrame priv_df is created from the DataFrame imported_df, ensuring that the data remains private and secure within the AG environment.

    # Create a PrivateDataFrame 'priv_df' from the imported DataFrame 'imported_df'
    metadata = {
    'age': (0, 80), # Define metadata bounds for the 'age' column
    'salary': (1, 200000) # Define metadata bounds for the 'salary' column
    }

    priv_df = PrivateDataFrame(imported_df, metadata=metadata)

By leveraging the private_import function and creating PrivateDataFrames, users can seamlessly work with external data while maintaining privacy.

Continue the op_pandas guide.

Access the Managing Data to continue following the op_pandas guide.

Managing Data

This section presents the op_pandas library guide and addresses data management tasks.

Viewing Data

Records in PrivateDataFrame and PrivateSeries cannot be viewed directly to protect privacy. However, users can still analyze and obtain statistical information about the data using methods that offer differential privacy guarantees.

Printing details about the data

Inspect PrivateDataFrame structure using ag_print

To print details about the data, such as columns, metadata, and data types, within the AG environment, users can use the ag_print function. It can quickly inspect the details of their data within the AG environment, facilitating data analysis and exploration.

See the following example:

%%ag

ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)

When executed:

>>> Columns:
Index(['name', 'age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
name object
age int64
salary int64
dtype: object

Generating quick statistics

Generate differentially private statistical summaries using describe() method with epsilon budget

Users can obtain quick statistics about your dataset using the describe() method in pandas. By spending some epsilon, you can get an idea about the statistical details of the dataset.

See the following example:

%%ag

priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)

When executed:

>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075

Statistics can be viewed by exporting the non-private result to the local Jupyter server:

%%ag

export(priv_describe, name='priv_describe')
Setting up exported variable in local environment: priv_describe
print(priv_describe)
>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075


Cleaning Data

Users can use the dropna method to remove any records that contain NaN values in any of its features.

%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})

export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')
>>> Setting up exported variable in local environment: result

/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._df[col].clip(lower=self._metadata[col][0], upper=self._metadata[col][1], inplace=True)
print(result)
>>>
a b
count 2490.000000 2490.000000
mean 1.495172 1.502561
std 0.496467 0.483317
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.264635 1.668942
75% 1.985666 1.975653
max 1.997052 1.993214
print(df_2.dropna(axis=0).describe())
>>>
a b
count 2490.000000 2490.000000
mean 1.492369 1.503213
std 0.500042 0.500090
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 2.000000
75% 2.000000 2.000000
max 2.000000 2.000000

Columns from the PrivateDataFrame can also be dropped using the drop functionality, just like this:

%%ag
ag_print("Columns before: ", priv_df.columns)

temp_df = priv_df.drop(['name'], inplace = False)

ag_print("Columns After Dropping name: ", temp_df.columns)

Response:

>>> Columns before:  Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name: Index(['age', 'salary'], dtype='object')

Selecting Data

Setting values that affect a particular set of records is not allowed in PrivateDataFrame or PrivateSeries. However, transformation functions can be applied using PrivateDataFrame.ApplyMap or PrivateSeries.Map.

Getting Data Objects from a single column

Select individual columns from PrivateDataFrame to create PrivateSeries objects with preserved metadata

To select a single column from a DataFrame and obtain a PrivateSeries, equivalent to df["A"], use the following approach:

%%ag

# Select the 'age' column from the PrivateDataFrame 'priv_df' and obtain a PrivateSeries
priv_s = priv_df['age']

# Print quick statistics and metadata of the PrivateSeries 'priv_s'
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)

When executed:

>>> Describe:
count 9988.000000
mean 39.472473
std 22.735391
min 0.000000
25% 21.384668
50% 37.840602
75% 55.273188
max 78.730881
Name: series, dtype: float64
Metadata:
(0, 80)

In this example:

  • The priv_df['age'] syntax is used to select the 'age' column from the PrivateDataFrame priv_df, resulting in a PrivateSeries priv_s.
  • The describe() method is applied to the PrivateSeries priv_s with an epsilon value of 1 to obtain quick statistics about the data.
  • The output displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for the 'age' column.
  • The metadata bounds for the 'age' column are also printed, indicating that valid values range from 0 to 80.

Getting data objects from various columns

Select multiple columns from PrivateDataFrame to create filtered PrivateDataFrame subsets

To select a collection of columns from a DataFrame and obtain a PrivateDataFrame, you can use the following approach:

%%ag

# Select the 'age' and 'salary' columns from the PrivateDataFrame 'priv_df' and obtain a PrivateDataFrame 'priv_df_filtered'
priv_df_filtered = priv_df[['age', 'salary']]

# Print the columns and metadata of the PrivateDataFrame 'priv_df_filtered'
ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)

When executed:

>>> Columns:
Index(['age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}

In this example:

  • The priv_df[['age', 'salary']] syntax is used to select the 'age' and 'salary' columns from the PrivateDataFrame priv_df, resulting in a PrivateDataFrame priv_df_filtered.
  • The columns attribute of the PrivateDataFrame priv_df_filtered is printed to display the selected columns.
  • The metadata attribute of the PrivateDataFrame priv_df_filtered is printed to show the metadata bounds for each selected column.

Applying transformation functions

Transform PrivateDataFrame and PrivateSeries data using applymap and map functions with custom transformation logic

To use Applymap on a PrivateDataFrame, use the following approach:

%%ag

# Define a transformation function 'func' to map strings to their lengths and numerical values to their halves
def func(x: str | int | float) -> float:
if isinstance(x, str):
return len(x)
elif isinstance(x, (int, float)):
return x / 2
return 0.0

# Apply the transformation function 'func' to each element of the PrivateDataFrame 'priv_df' using applymap
result = priv_df.applymap(func, eps=1)

# Print the metadata and quick statistics of the resulting PrivateDataFrame 'result'
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps=1))

When executed:

>>> Metadata:
{'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}

Describe:
name age salary
count 10001.000000 10001.000000 10001.000000
mean 9.979915 19.763231 25068.695612
std 0.274668 10.477926 16333.537130
min 8.000000 0.000000 686.474307
25% 10.000000 8.740750 12489.945476
50% 10.000000 18.864365 24417.320448
75% 10.000000 28.283437 37265.078587
max 10.000000 35.126989 60705.413286

In this example:

  • The func function is defined to map strings to their lengths and numerical values to their halves.
  • The applymap() function is used to apply the transformation function func to each element of the PrivateDataFrame priv_df.
  • The resulting PrivateDataFrame result contains transformed values.
  • The metadata of the resulting PrivateDataFrame result is printed, showing the updated metadata bounds.
  • Quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the PrivateDataFrame result are printed, providing insights into the transformed data.

To use applymap on a PrivateSeries. The mapping can be done using a dictionary for 1:1 mapping or via a callable method.

%%ag

# Define a mapping function 'series_map' to halve numerical values
def series_map(x: int) -> float:
return x / 2

# Apply the mapping function 'series_map' to the 'age' column of the PrivateDataFrame 'priv_df'
priv_df['age'] = priv_df['age'].map(series_map, eps=1)

# Print the metadata of the updated 'age' column in 'priv_df'
ag_print("Metadata:\n", priv_df.metadata)

When executed:

>>> Metadata:
{'age': (0.0, 64.0), 'salary': (1, 200000)}

In this example:

  • The series_map function is defined to halve numerical values.
  • The map() function is used to apply the mapping function series_map to the 'age' column of the PrivateDataFrame priv_df.
  • The metadata of the 'age' column in the updated PrivateDataFrame priv_df is printed, showing the updated metadata bounds.
Continue the op_pandas guide.

See the Operations page to continue following the op_pandas guide.

Operations

This section continues by presenting the op_pandas library guide and addressing some of the available operations you can perform on PrivateDataFrame and PrivateSeries objects.

Unary Ops

Users can perform unary operations such as ~, -, +, and abs on PrivateDataFrames and PrivateSeries. These operations apply element-wise to the data.

OperatorDescription
~The ~ operator performs the bitwise negation operation.
-The - operator performs the arithmetic negation operation.
+The + operator performs the arithmetic addition operation.
abs()The abs() function calculates the absolute value of each element.

See the following example:

%%ag

# Export the quick statistics of the original PrivateDataFrame 'priv_df_2' and its negative counterpart
export(priv_df_2.describe(eps=2), 'original')
export((-priv_df_2).describe(eps=2), 'negative')

When executed:

>>>
Setting up exported variable in local environment: original
Setting up exported variable in local environment: negative
# Rename columns of the negative DataFrame for clarity
negative.columns = ["a_neg", "b_neg"]

# Join the original and negative DataFrames and print the result
print(original.join(negative, how="left"))

Output:

>>>
a b a_neg b_neg
count 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.498597 1.504388 -1.500690 -1.503697
std 0.494197 0.498858 0.498402 0.499788
min 1.000000 1.000000 -1.000000 -1.000000
25% 1.000000 1.000000 -1.001932 -1.001540
50% 1.635791 1.891447 -1.678784 -1.161680
75% 1.991538 1.997409 -1.996417 -1.996500
max 1.992424 1.997140 -1.999776 -1.999894

Where:

  • The quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the original PrivateDataFrame priv_df_2 and its negative counterpart are exported to the local environment.
  • The negative DataFrame is created by applying the unary - operator to the original PrivateDataFrame priv_df_2.
  • The columns of the negative DataFrame are renamed for clarity.
  • The original and negative DataFrames are joined together, and the result is printed, showing the element-wise application of the unary - operator.

Binary Ops

Users can apply binary operations using scalars and PrivateDataFrames against PrivateDataFrames. See the example below:

%%ag

# Select the 'age' and 'salary' columns from the PrivateDataFrame 'priv_df' and obtain a PrivateDataFrame 'pdf'
pdf = priv_df[['age', 'salary']]

# Perform binary operations on 'pdf' with a mix of scalars and 'pdf' itself
result1 = pdf + (10 * pdf) # Expected min-max: Age: (0, 704), Salary: (11, 2200000)
result2 = result1 / 1000 # Expected min-max: Age: (0, 0.704), Salary: (0.011, 2200)

# Print the metadata of the resulting PrivateDataFrames 'result1' and 'result2'
ag_print("Result1 metadata: \n", result1.metadata)
ag_print("Result2 metadata: \n", result2.metadata)

When executed:

>>>
Result1 metadata:
{'age': (0.0, 704.0), 'salary': (11, 2200000)}
Result2 metadata:
{'age': (0.0, 0.704), 'salary': (0.011, 2200.0)}

In it:

  • The 'age' and 'salary' columns are selected from the PrivateDataFrame priv_df to create a new PrivateDataFrame pdf.
  • Binary operations are performed on pdf using a mix of scalars and pdf.
  • result1 is obtained by adding pdf with 10 times pdf, and result2 is obtained by dividing result1 by 1000.
  • The resulting PrivateDataFrames result1 and result2 metadata are printed, showing the updated metadata bounds after the binary operations.

Bitwise Ops

Users can apply bitwise operations using scalars and PrivateDataFrames against PrivateDataFrames. These operations apply element-wise to the data.

See the following example:

%%ag
import numpy as np
import pandas as pd

# Create two PrivateSeries with randomly sampled integer data containing values in the range (0,1)
priv_ser_1 = PrivateSeries(pd.Series(np.random.randint(0, 2, 10000)), metadata=(0, 1))
priv_ser_2 = PrivateSeries(pd.Series(np.random.randint(0, 2, 10000)), metadata=(0, 1))

# Print the description of the first PrivateSeries
ag_print("Describe of private Series 1: \n", priv_ser_1.describe(eps=1))

# Print the description of the second PrivateSeries
ag_print("Describe of private Series 2: \n", priv_ser_2.describe(eps=1))

# Apply the bitwise AND operation between priv_ser_1 and priv_ser_2 and store the result in 'result'
result = priv_ser_1 & priv_ser_2

# Print the description of the resulting PrivateSeries
ag_print("Describe of the result: \n", result.describe(eps=1))

When executed:

>>>
Describe of private Series 1:
count 9.998000e+03
mean 1.571300e-03
std 1.998231e-02
min 0.000000e+00
25% 4.656613e-10
50% 4.656613e-10
75% 4.656613e-10
max 4.656613e-10
Name: series, dtype: float64

Describe of private Series 2:
count 1.000500e+04
mean 5.608570e-04
std 4.612496e-02
min 0.000000e+00
25% 4.656613e-10
50% 4.656613e-10
75% 4.656613e-10
max 4.656613e-10
Name: series, dtype: float64

Describe of the result:
count 1.000700e+04
mean 3.952059e-04
std 2.277582e-02
min 0.000000e+00
25% 4.656613e-10
50% 4.656613e-10
75% 4.656613e-10
max 4.656613e-10
Name: series, dtype: float64

In it:

  • Two PrivateSeries priv_ser_1 and priv_ser_2 are created with randomly sampled integer data containing values in the range (0,1).
  • The descriptions of both PrivateSeries are printed, displaying the count, mean, std, min, 25%, 50%, 75%, and max values.
  • The bitwise AND operation (&) is applied between priv_ser_1 and priv_ser_2, and the result is stored in result.
  • The description of the resulting PrivateSeries result is printed, showing the statistics of the element-wise bitwise AND operation.
Continue the op_pandas guide.

See the Functions, Joins and Statistical Methods page to continue following the op_pandas guide.

Functions, Joins and Statistical Methods

This section continue presenting the op_pandas library guide, addressing functions, joins and statistical methods.

General Functions

op_pandas comes packaged with some useful functions such as:

concat

Concatenate multiple PrivateDataFrames vertically over indices with same column datatypes

Users can concatenate PrivateDataFrames, by creating a copy of the dataframe and then concatenating it with the original. See the example below:

1
First, create a DataFrame df with random data:
arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
res = ''.join(random.choices(string.ascii_lowercase, k=N))
arr_name.append(res)

df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})

session.private_import(data=df, name='imported_df_copy')
>>> dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel
2
Next, create a PrivateDataFrame priv_df_copy using the imported DataFrame:
%%ag
metadata = {
'age': (0, 80),
'salary': (1, 200000)
}

priv_df_copy = PrivateDataFrame(imported_df_copy, metadata=metadata)
3
Then, print the counts of records in the original copied PrivateDataFrames, and concatenate them:
%%ag
ag_print("First DF count:\n", priv_df.count(eps=1))
ag_print("Second DF count:\n", priv_df_copy.count(eps=1))

import op_pandas
concat_df = op_pandas.concat([priv_df, priv_df_copy])
ag_print("Concat DF count: \n", concat_df.count(eps=1))

The resulting output should be as follows:

>>>
First DF count:
name 10000
age 9999
salary 9988
dtype: int64

Second DF count:
name 9998
age 9999
salary 10001
dtype: int64

Concat DF count:
name 19997
age 20000
salary 20004
dtype: int64

concat is only used to concatenate PrivateDataFrames over indices.

Datatypes

The datatypes of the same columns must be the same across all data frames.

train_test_split

Split a PrivateDataFrame into training and testing subsets for machine learning workflows

Users can utilize op_pandas.train_test_split to divide a PrivateDataFrame into train and test splits. As in the example below:

%%ag

priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)

ag_print("Count of train split:\n", priv_df_train.count(eps=1))
ag_print("Count of test split:\n", priv_df_test.count(eps=1))

Output when executed:

>>
Count of train split:
name 7501
age 7503
salary 7500
dtype: int64

Count of test split:
name 2492
age 2499
salary 2491
dtype: int64

In this example, op_pandas.train_test_split splits the PrivateDataFrame priv_df into train and test sets. The counts of records in each split are printed, showing the distribution of data between the training and testing subsets.


Join and Where

This section explores different operations related to joining PrivateDataFrame and PrivateSeries objects and using the WHERE and JOIN method for filtering data.

JOIN on Private objects

Joining two PrivateDataframes
%%ag
"""
pdf: {randomly sampled}
columns = ['A','B','C']
metadata = (0,100) for all columns
size = 1000

pdf2: {randomly sampled}
columns = ['E','F']
metadata = (-100,100) for all columns
size = 1000

ps: {randomly sampled series}
name = 'series'
metadata = (0,100)
size = 1000
"""

pdf = PrivateDataFrame(pd.DataFrame({"A": np.random.randint(0,100,1000), "B": np.random.randint(0,100,1000), "C": np.random.randint(0,100,1000)}), metadata = {"A": (0, 100), "B": (0, 100), "C": (0, 100)} , foreign_keys={"A": (2, "")} )
pdf2 = PrivateDataFrame(pd.DataFrame({"E": np.random.randint(-100,100,1000), "F": np.random.randint(-100,100,1000)}), metadata = {"E": (-100, 100), "F": (-100, 100)} , foreign_keys={"E": (1, "")} )
ps = PrivateSeries(pd.Series(np.random.randint(0,100,1000)), metadata=(0,100) , foreign_keys={"A": (2, "")} )

*foregin keys should be specified on the metadata with maximum frequency attribute.
*on parameter must be provided in join , if you want to join to dataframe with different column name use merge.

%%ag #join with dataframe
result = pdf.join(pdf2,on="A" , how="outer")
ag_print("Outer Join of two PDFs Describe:\n",result.describe(eps=1))

>>> Outer Join of two PDFs Describe:
A B C A F
count 978.000000 978.000000 978.000000 978.000000 978.000000
mean 45.078161 48.373313 47.395728 2.918509 0.379322
std 41.659230 39.782047 3.978732 55.174294 76.570248
min 0.000000 0.000000 0.000000 -100.000000 -100.000000
25% 25.835038 16.814188 13.689315 -55.716763 -79.960320
50% 45.534085 55.195066 57.023430 3.266460 -5.226773
75% 62.600983 63.868197 68.137527 40.907116 58.650339
max 93.493320 94.355744 79.755250 58.467962 89.288805

Joining PrivateDataframe with PrivateSeries:

%%ag # join with series
result = pdf.join(ps, on="A" , how="inner")
ag_print("Inner Join of PDF and Series Describe:\n",result.describe(eps=1))

>>> Inner Join of PDF and Series Describe:
A B C series
count 995.000000 995.000000 995.000000 995.000000
mean 50.461822 50.737590 44.956548 52.693788
std 28.974945 35.573564 28.876807 39.639266
min 0.000000 0.000000 0.000000 0.000000
25% 11.820072 43.155183 48.520076 8.472153
50% 65.126290 35.547530 28.150633 60.640127
75% 89.179010 85.479518 50.462041 68.068637
max 97.001126 79.362787 92.063356 75.326474

WHERE Query

Filter PrivateDataFrames based on conditional expressions
%%ag

result = pdf.where(pdf > 0)
ag_print("Where Query Result:\n", result.describe(eps=1))

>>> Where Query Result:
A B C
count 1001.000000 1001.000000 1001.000000
mean 48.506532 50.754384 53.904313
std 42.434438 24.036048 27.686005
min 0.000000 0.000000 0.000000
25% 23.367422 21.512664 4.402222
50% 33.757446 52.456891 22.643254
75% 65.482272 81.311059 74.171133
max 92.626234 84.918578 98.803625


Merge on Private objects

Merging two PrivateDataFrames:

%%ag
"""
pdf1: {randomly sampled}
columns = ['A','B','C']
metadata = (0,100) for all columns
size = 1000

pdf2: {randomly sampled}
columns = ['E','D','C']
metadata = (0,100) for all columns
size = 1000
"""

pdf1 = PrivateDataFrame(pd.DataFrame({"A": np.random.randint(0,100,1000), "B": np.random.randint(0,100,1000), "C": np.random.choice(['p', 'q', 'r'], 1000)}), metadata = {"A": (0, 100), "B": (0, 100), "C": (0, 100)} , foreign_keys={"A": (1, "table1")} )
pdf2 = PrivateDataFrame(pd.DataFrame({"E": np.random.randint(0,100,1000), "D": np.random.randint(0,100,1000), "C": np.random.choice(['r', 's', 'p'], 1000)}), metadata = {"E": (0, 100), "D": (0, 100), "C": (0, 100)} , foreign_keys={"E": (3, "table2")} )

%%ag #merge with dataframe
merged_df = pdf1.merge(pdf2, how="right", left_on="A", right_on="E")
ag_print("Merged DataFrame Describe:\n", merged_df.describe(eps=1))

>>> Merged DataFrame Describe:
A B C E D
count 978.000000 978.000000 978.000000 978.000000 978.000000
mean 45.078161 48.373313 47.395728 2.918509 0.379322
std 41.659230 39.782047 3.978732 55.174294 76.570248
min 0.000000 0.000000 0.000000 -100.000000 -100.000000
25% 25.835038 16.814188 13.689315 -55.716763 -79.960320
50% 45.534085 55.195066 57.023430 3.266460 -5.226773
75% 62.600983 63.868197 68.137527 40.907116 58.650339
max 93.493320 94.355744 79.755250 58.467962 89.288805

In this example, the merge method is used to combine pdf1 and pdf2 on columns A and E with a right join. The resulting merged DataFrame is described using the describe method with an epsilon value of 1 to ensure differential privacy.


Statistical Methods

PrivateDataframe and PrivateSeries support various statistical methods with APIs similar to pandas.DataFrame and pandas. Series, such as:

  • summeancount
  • stdvar
  • quantilepercentile , median
  • bounded min and max can be achieved using percentile(0) and percentile(100) respectively.
  • hist and hist2d.
  • cov (covariance), skew (skewness), corr (correlation).

Basic Statistics

Calculate basic differentially private statistics like variance, count, and percentiles on PrivateSeries

Users can generate statistical insights about the data while preserving privacy using differentially private methods such as variance, count, percentile on a PrivateSeries. See the following example:

%%ag
var = ps.var(eps=1)
count = ps.count(eps=1)

ag_print(f"variance = {var} , count = {count}")
>>>
variance = 806.7200376847192 , count = 100

In it, the differentially private variance (var) and count (count) of the PrivateSeries ps are calculated with an epsilon value of 1. These values provide statistical insights about the data while preserving differential privacy.

It is possible to find the bounded max and min values in the following way:

%%ag
min = ps.percentile(eps=0.1, p=0)
max = ps.percentile(eps=0.1, p=100)

ag_print(f"min = {min} , max = {max}") ### Actual min, max : (0,100)
>>>
min = 3.7421055945821973 , max = 99.0959021484033

In this example, the differentially private bounded minimum (min) and maximum (max) of the PrivateSeries ps are calculated using the percentile method with an epsilon value of 0.1 and percentiles 0 and 100, respectively. These values provide bounds on the minimum and maximum values in the dataset while preserving differential privacy.

Advanced Statistics

Perform complex statistical calculations including covariance, correlation, and skewness with differential privacy

Users can perform complex differentially private statistical calculations such as:

  • Covariance
  • Skewness
  • Correlation

Below, we demonstrate finding the correlation matrix of a PrivateDataFrame and compare the results with the expected initial result:

  1. The correlation matrix of the PrivateDataFrame priv_df is calculated with an epsilon value of 3 using the corr method.

  2. The resulting correlation matrix is exported to the local environment and printed.

    %%ag

    # Calculate the correlation matrix of the PrivateDataFrame priv_df with an epsilon value of 3
    result = priv_df.corr(eps=3)

    # Export the result to the local environment
    export(result, 'private_result')
    >>>
    Setting up exported variable in local environment: private_result
    # Print the calculated correlation matrix
    print(private_result)
    >>>            age    salary
    age 1.0 0.032151
    salary 0.032151 1.0
  3. The correlation matrix of the original DataFrame df[['age', 'salary']] is calculated using pandas.

    # Print the correlation matrix of the original DataFrame
    print(df[['age', 'salary']].corr())
    >>>             age    salary
    age 1.000000 0.006333
    salary 0.006333 1.000000

The calculated correlation matrices from both the private and original datasets are compared.

Histograms

Generate differentially private histograms from PrivateDataFrames for data visualization and distribution analysis

Users can generate differentially private histograms using PrivateDataFrame.hist().

%%ag

hist_data = priv_df.hist(column='salary',eps=0.1)
export(hist_data , 'hist_data')
>>> Setting up exported variable in local environment: hist_data

Users can use Matplotlib or any other plotting library to visualize the histogram locally.

import matplotlib.pyplot as plt
dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins)*0.8, align='edge')

# Display the plot
plt.show()