Importing Data
Pandas
Before starting the guide, finish the First Steps to Use AGENT guide.
Overview
AGENT provides a comprehensive guide to present all the main functionalities of op_pandas.
The guide is divided in four parts, to have a better experience, follow the proposed order:
Managing Data
Operations
Functions, Joins and Statistical Methods
Pandas is a widespread open-source data manipulation and analysis library for Python. It provides data structures and functions to efficiently handle and manipulate structured data, such as tables or time series. Pandas offers powerful data cleaning, transformation, filtering, merging, and aggregation tools. It is widely used in data science, machine learning, and other domains for data preprocessing and analysis tasks, making it a valuable tool for working with structured data in Python.
AGENT provides a differentially private version of the Pandas Library (op_pandas), which lets users handle private data frames and series and perform various statistical analyses with differential privacy guarantees. Users familiar with Pandas will find minimal difficulty adjusting to the API methods.
To use AGENT’s op_pandas, you can import the library as presented in the following code block:
%%ag
from op_pandas import PrivateDataFrame, PrivateSeries
Private datasets can be loaded as PrivateDataFrames and PrivateSeries.
The following are additional resources that will be helpful when using the Pandas library:
- Official Pandas Documentation
- API Reference: In addition to the comprehensive guide, you can also check reference pages for PrivateDataFrame, PrivateSeries, General Methods
Importing Data
The op_pandas library allows users to import datasets efficiently. This page will showcase some of the available ways to do so.
Before continue following the steps described in this page, be sure to have finished the First Steps to Use AGENT
Object Creation
Users can import and load datasets from different sources, such as:
Importing from the AGENT Jupyter Server
The load_dataset() lets users obtain a dataset and the required data structures from the Antigranular server.
Private data structures cannot be exported to the local environment. Unless a differentially private measure is applied to obtain a non-private data frame.
You can use the load_dataset() function to load any dataset, as shown in the following code block:
%%ag
from op_pandas import PrivateDataFrame, PrivateSeries
# Obtaining the dictionary containing private objects
response = load_dataset("<dataset_name>", "<team_name>")
# Response will be a PDF, and will be using the budget allocated to the user from "<team_name>" team.
Importing from a pandas.Series
When creating a PrivateSeries, it's recommended to set metadata bounds to define the range of valid values for the series. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.
See an example in the following code block:
%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9] , name='Test_series')
priv_s = PrivateSeries(series=s,metadata=(0,10))
Where:
- A pandas Series named
'Test_series'is created with the values[1, 5, 8, 2, 9]. - Using the
PrivateSeriesconstructor, we create a private series (priv_s) from the regular pandas Seriess. - Metadata bounds
(0, 10)are set, ensuring that all values in the series fall within the range from 0 to 10.
By setting metadata bounds, you control the valid range of values within the series, enhancing privacy and security while working with sensitive data.
Importing from a pandas.DataFrame
Just as with PrivateSeries, setting metadata bounds when creating a PrivateDataFrame is recommended. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.
See an example in the following code block:
%%ag
import pandas as pd
data = {
'Age':[20,30,40,25,30,25,26,27,28,29],
'Salary':[35000,60000,100000,55000, 35000,60000,100000,55000,35000,60000],
'Sex':['M','F','M','F', 'M','F','M','F', 'M', 'F']
}
metadata = {
'Age':(18,65),
'Salary':(20000,200000)
}
categorical_metadata = {
'Sex':['M','F']
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata, categorical_metadata=categorical_metadata)
In the example:
- Data for the DataFrame is defined, including columns for 'Age', 'Salary', and 'Sex'.
- Metadata bounds are specified for each column:
- For the 'Age' column, the valid range is set from 18 to 65.
- For the 'Salary' column, the valid range is set from 20000 to 200000.
- Categorical Metadata is specified for 'Sex' column.
- A pandas DataFrame (
df) is created using the defined data. - Using the
PrivateDataFrameconstructor, a privateDataFrame (priv_df) is created from the pandas.DataFramedf, with specified metadata bounds.
By setting metadata bounds, users can ensure that each column in the dataframe contains values within predefined limits, enhancing data integrity and security.
Importing from the local Jupyter session
Users can import external data from their local Jupyter session within the AG environment. This allows seamless data integration into the AG environment while maintaining privacy and security.
See an example below:
-
Random data is generated to create two pandas DataFrames,
dfanddf_2, representing different datasets.import pandas as pd
import numpy as np
import string
import random
# Generate random names, ages, and salaries for the DataFrame
arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
res = ''.join(random.choices(string.ascii_lowercase, k=N))
arr_name.append(res)
# Create a DataFrame with random data
df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})
# Import the DataFrame 'df' into the AG environment with the name 'imported_df'
session.private_import(data=df, name='imported_df')Now the second dataset with NaNs is created:
# Randomly distributing NaNs in two columns with a probability of 0.5
choice = [1, 2, np.nan]
a = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
b = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
# Create a DataFrame 'df_2' with random data and NaNs
df_2 = pd.DataFrame({'a': a, 'b': b})
# Import the DataFrame 'df_2' into the AG environment with the name 'imported_df_2'
session.private_import(data=df_2, name="imported_df_2")The
private_importfunction imports these DataFrames into the AG environment with specified names (imported_dfandimported_df_2). -
Now, Metadata bounds are defined for columns 'age' and 'salary' of the DataFrame
imported_df, and a PrivateDataFramepriv_dfis created from the DataFrameimported_df, ensuring that the data remains private and secure within the AG environment.# Create a PrivateDataFrame 'priv_df' from the imported DataFrame 'imported_df'
metadata = {
'age': (0, 80), # Define metadata bounds for the 'age' column
'salary': (1, 200000) # Define metadata bounds for the 'salary' column
}
priv_df = PrivateDataFrame(imported_df, metadata=metadata)
By leveraging the private_import function and creating PrivateDataFrames, users can seamlessly work with external data while maintaining privacy.
op_pandas guide.Access the Managing Data to continue following the op_pandas guide.
Managing Data
This section presents the op_pandas library guide and addresses data management tasks.
Viewing Data
Records in PrivateDataFrame and PrivateSeries cannot be viewed directly to protect privacy. However, users can still analyze and obtain statistical information about the data using methods that offer differential privacy guarantees.
Printing details about the data
Inspect PrivateDataFrame structure using ag_print
Printing details about the data
Inspect PrivateDataFrame structure using ag_printTo print details about the data, such as columns, metadata, and data types, within the AG environment, users can use the ag_print function. It can quickly inspect the details of their data within the AG environment, facilitating data analysis and exploration.
See the following example:
%%ag
ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)
When executed:
>>> Columns:
Index(['name', 'age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
name object
age int64
salary int64
dtype: object
Generating quick statistics
Generate differentially private statistical summaries using describe() method with epsilon budget
Generating quick statistics
Generate differentially private statistical summaries using describe() method with epsilon budgetUsers can obtain quick statistics about your dataset using the describe() method in pandas. By spending some epsilon, you can get an idea about the statistical details of the dataset.
See the following example:
%%ag
priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)
When executed:
>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075
Statistics can be viewed by exporting the non-private result to the local Jupyter server:
%%ag
export(priv_describe, name='priv_describe')
Setting up exported variable in local environment: priv_describe
print(priv_describe)
>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075
Cleaning Data
Users can use the dropna method to remove any records that contain NaN values in any of its features.
%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})
export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')
>>> Setting up exported variable in local environment: result
/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._df[col].clip(lower=self._metadata[col][0], upper=self._metadata[col][1], inplace=True)
print(result)
>>>
a b
count 2490.000000 2490.000000
mean 1.495172 1.502561
std 0.496467 0.483317
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.264635 1.668942
75% 1.985666 1.975653
max 1.997052 1.993214
print(df_2.dropna(axis=0).describe())
>>>
a b
count 2490.000000 2490.000000
mean 1.492369 1.503213
std 0.500042 0.500090
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 2.000000
75% 2.000000 2.000000
max 2.000000 2.000000
Columns from the PrivateDataFrame can also be dropped using the drop functionality, just like this:
%%ag
ag_print("Columns before: ", priv_df.columns)
temp_df = priv_df.drop(['name'], inplace = False)
ag_print("Columns After Dropping name: ", temp_df.columns)
Response:
>>> Columns before: Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name: Index(['age', 'salary'], dtype='object')
Selecting Data
Setting values that affect a particular set of records is not allowed in PrivateDataFrame or PrivateSeries. However, transformation functions can be applied using PrivateDataFrame.ApplyMap or PrivateSeries.Map.
Getting Data Objects from a single column
Select individual columns from PrivateDataFrame to create PrivateSeries objects with preserved metadata
Getting Data Objects from a single column
Select individual columns from PrivateDataFrame to create PrivateSeries objects with preserved metadataTo select a single column from a DataFrame and obtain a PrivateSeries, equivalent to df["A"], use the following approach:
%%ag
# Select the 'age' column from the PrivateDataFrame 'priv_df' and obtain a PrivateSeries
priv_s = priv_df['age']
# Print quick statistics and metadata of the PrivateSeries 'priv_s'
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)
When executed:
>>> Describe:
count 9988.000000
mean 39.472473
std 22.735391
min 0.000000
25% 21.384668
50% 37.840602
75% 55.273188
max 78.730881
Name: series, dtype: float64
Metadata:
(0, 80)
In this example:
- The
priv_df['age']syntax is used to select the 'age' column from the PrivateDataFramepriv_df, resulting in a PrivateSeriespriv_s. - The
describe()method is applied to the PrivateSeriespriv_swith an epsilon value of 1 to obtain quick statistics about the data. - The output displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for the 'age' column.
- The metadata bounds for the 'age' column are also printed, indicating that valid values range from 0 to 80.
Getting data objects from various columns
Select multiple columns from PrivateDataFrame to create filtered PrivateDataFrame subsets
Getting data objects from various columns
Select multiple columns from PrivateDataFrame to create filtered PrivateDataFrame subsetsTo select a collection of columns from a DataFrame and obtain a PrivateDataFrame, you can use the following approach:
%%ag
# Select the 'age' and 'salary' columns from the PrivateDataFrame 'priv_df' and obtain a PrivateDataFrame 'priv_df_filtered'
priv_df_filtered = priv_df[['age', 'salary']]
# Print the columns and metadata of the PrivateDataFrame 'priv_df_filtered'
ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)
When executed:
>>> Columns:
Index(['age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
In this example:
- The
priv_df[['age', 'salary']]syntax is used to select the 'age' and 'salary' columns from the PrivateDataFramepriv_df, resulting in a PrivateDataFramepriv_df_filtered. - The
columnsattribute of the PrivateDataFramepriv_df_filteredis printed to display the selected columns. - The
metadataattribute of the PrivateDataFramepriv_df_filteredis printed to show the metadata bounds for each selected column.
Applying transformation functions
Transform PrivateDataFrame and PrivateSeries data using applymap and map functions with custom transformation logic
Applying transformation functions
Transform PrivateDataFrame and PrivateSeries data using applymap and map functions with custom transformation logicTo use Applymap on a PrivateDataFrame, use the following approach:
%%ag
# Define a transformation function 'func' to map strings to their lengths and numerical values to their halves
def func(x: str | int | float) -> float:
if isinstance(x, str):
return len(x)
elif isinstance(x, (int, float)):
return x / 2
return 0.0
# Apply the transformation function 'func' to each element of the PrivateDataFrame 'priv_df' using applymap
result = priv_df.applymap(func, eps=1)
# Print the metadata and quick statistics of the resulting PrivateDataFrame 'result'
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps=1))
When executed:
>>> Metadata:
{'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}
Describe:
name age salary
count 10001.000000 10001.000000 10001.000000
mean 9.979915 19.763231 25068.695612
std 0.274668 10.477926 16333.537130
min 8.000000 0.000000 686.474307
25% 10.000000 8.740750 12489.945476
50% 10.000000 18.864365 24417.320448
75% 10.000000 28.283437 37265.078587
max 10.000000 35.126989 60705.413286
In this example:
- The
funcfunction is defined to map strings to their lengths and numerical values to their halves. - The
applymap()function is used to apply the transformation functionfuncto each element of the PrivateDataFramepriv_df. - The resulting PrivateDataFrame
resultcontains transformed values. - The metadata of the resulting PrivateDataFrame
resultis printed, showing the updated metadata bounds. - Quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the PrivateDataFrame
resultare printed, providing insights into the transformed data.
To use applymap on a PrivateSeries. The mapping can be done using a dictionary for 1:1 mapping or via a callable method.
%%ag
# Define a mapping function 'series_map' to halve numerical values
def series_map(x: int) -> float:
return x / 2
# Apply the mapping function 'series_map' to the 'age' column of the PrivateDataFrame 'priv_df'
priv_df['age'] = priv_df['age'].map(series_map, eps=1)
# Print the metadata of the updated 'age' column in 'priv_df'
ag_print("Metadata:\n", priv_df.metadata)
When executed:
>>> Metadata:
{'age': (0.0, 64.0), 'salary': (1, 200000)}
In this example:
- The
series_mapfunction is defined to halve numerical values. - The
map()function is used to apply the mapping functionseries_mapto the 'age' column of the PrivateDataFramepriv_df. - The metadata of the 'age' column in the updated PrivateDataFrame
priv_dfis printed, showing the updated metadata bounds.
op_pandas guide.See the Operations page to continue following the op_pandas guide.
Operations
This section continues by presenting the op_pandas library guide and addressing some of the available operations you can perform on PrivateDataFrame and PrivateSeries objects.
Unary Ops
Users can perform unary operations such as ~, -, +, and abs on PrivateDataFrames and PrivateSeries. These operations apply element-wise to the data.
| Operator | Description |
|---|---|
~ | The ~ operator performs the bitwise negation operation. |
- | The - operator performs the arithmetic negation operation. |
+ | The + operator performs the arithmetic addition operation. |
abs() | The abs() function calculates the absolute value of each element. |
See the following example:
%%ag
# Export the quick statistics of the original PrivateDataFrame 'priv_df_2' and its negative counterpart
export(priv_df_2.describe(eps=2), 'original')
export((-priv_df_2).describe(eps=2), 'negative')
When executed:
>>>
Setting up exported variable in local environment: original
Setting up exported variable in local environment: negative
# Rename columns of the negative DataFrame for clarity
negative.columns = ["a_neg", "b_neg"]
# Join the original and negative DataFrames and print the result
print(original.join(negative, how="left"))
Output:
>>>
a b a_neg b_neg
count 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.498597 1.504388 -1.500690 -1.503697
std 0.494197 0.498858 0.498402 0.499788
min 1.000000 1.000000 -1.000000 -1.000000
25% 1.000000 1.000000 -1.001932 -1.001540
50% 1.635791 1.891447 -1.678784 -1.161680
75% 1.991538 1.997409 -1.996417 -1.996500
max 1.992424 1.997140 -1.999776 -1.999894
Where:
- The quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the original PrivateDataFrame
priv_df_2and its negative counterpart are exported to the local environment. - The negative DataFrame is created by applying the unary
-operator to the original PrivateDataFramepriv_df_2. - The columns of the negative DataFrame are renamed for clarity.
- The original and negative DataFrames are joined together, and the result is printed, showing the element-wise application of the unary
-operator.
Binary Ops
Users can apply binary operations using scalars and PrivateDataFrames against PrivateDataFrames. See the example below:
%%ag
# Select the 'age' and 'salary' columns from the PrivateDataFrame 'priv_df' and obtain a PrivateDataFrame 'pdf'
pdf = priv_df[['age', 'salary']]
# Perform binary operations on 'pdf' with a mix of scalars and 'pdf' itself
result1 = pdf + (10 * pdf) # Expected min-max: Age: (0, 704), Salary: (11, 2200000)
result2 = result1 / 1000 # Expected min-max: Age: (0, 0.704), Salary: (0.011, 2200)
# Print the metadata of the resulting PrivateDataFrames 'result1' and 'result2'
ag_print("Result1 metadata: \n", result1.metadata)
ag_print("Result2 metadata: \n", result2.metadata)
When executed:
>>>
Result1 metadata:
{'age': (0.0, 704.0), 'salary': (11, 2200000)}
Result2 metadata:
{'age': (0.0, 0.704), 'salary': (0.011, 2200.0)}
In it:
- The 'age' and 'salary' columns are selected from the PrivateDataFrame
priv_dfto create a new PrivateDataFramepdf. - Binary operations are performed on
pdfusing a mix of scalars andpdf. result1is obtained by addingpdfwith 10 timespdf, andresult2is obtained by dividingresult1by 1000.- The resulting PrivateDataFrames
result1andresult2metadata are printed, showing the updated metadata bounds after the binary operations.
Bitwise Ops
Users can apply bitwise operations using scalars and PrivateDataFrames against PrivateDataFrames. These operations apply element-wise to the data.
See the following example:
%%ag
import numpy as np
import pandas as pd
# Create two PrivateSeries with randomly sampled integer data containing values in the range (0,1)
priv_ser_1 = PrivateSeries(pd.Series(np.random.randint(0, 2, 10000)), metadata=(0, 1))
priv_ser_2 = PrivateSeries(pd.Series(np.random.randint(0, 2, 10000)), metadata=(0, 1))
# Print the description of the first PrivateSeries
ag_print("Describe of private Series 1: \n", priv_ser_1.describe(eps=1))
# Print the description of the second PrivateSeries
ag_print("Describe of private Series 2: \n", priv_ser_2.describe(eps=1))
# Apply the bitwise AND operation between priv_ser_1 and priv_ser_2 and store the result in 'result'
result = priv_ser_1 & priv_ser_2
# Print the description of the resulting PrivateSeries
ag_print("Describe of the result: \n", result.describe(eps=1))
When executed:
>>>
Describe of private Series 1:
count 9.998000e+03
mean 1.571300e-03
std 1.998231e-02
min 0.000000e+00
25% 4.656613e-10
50% 4.656613e-10
75% 4.656613e-10
max 4.656613e-10
Name: series, dtype: float64
Describe of private Series 2:
count 1.000500e+04
mean 5.608570e-04
std 4.612496e-02
min 0.000000e+00
25% 4.656613e-10
50% 4.656613e-10
75% 4.656613e-10
max 4.656613e-10
Name: series, dtype: float64
Describe of the result:
count 1.000700e+04
mean 3.952059e-04
std 2.277582e-02
min 0.000000e+00
25% 4.656613e-10
50% 4.656613e-10
75% 4.656613e-10
max 4.656613e-10
Name: series, dtype: float64
In it:
- Two PrivateSeries
priv_ser_1andpriv_ser_2are created with randomly sampled integer data containing values in the range (0,1). - The descriptions of both PrivateSeries are printed, displaying the count, mean, std, min, 25%, 50%, 75%, and max values.
- The bitwise AND operation (
&) is applied betweenpriv_ser_1andpriv_ser_2, and the result is stored inresult. - The description of the resulting PrivateSeries
resultis printed, showing the statistics of the element-wise bitwise AND operation.
op_pandas guide.See the Functions, Joins and Statistical Methods page to continue following the op_pandas guide.
Functions, Joins and Statistical Methods
This section continue presenting the op_pandas library guide, addressing functions, joins and statistical methods.
General Functions
op_pandas comes packaged with some useful functions such as: