Pandas

Overview

The op_pandas is AGENT’s implementation of the Pandas library. The op_pandas allows you to import datasets and handle their data efficiently.

Navigate to the following reference sections for the API:

PrivateDataFrame

Differentially private DataFrame implementation with methods for data manipulation, statistical analysis, grouping, and transformations.

PrivateSeries

Privacy-preserving Series API for handling one-dimensional labeled arrays with differential privacy guarantees for statistical operations.

General Methods

Common utility functions and data preprocessing methods for op_pandas.

PrivateDataFrame

The PrivateDataFrame API is based on pandas.DataFrame, but in this case, all the methods are differentially private.

Constructor

The constructor for a PrivateDataFrame is as follows:

Constructor:

class op_pandas.PrivateDataFrame(
    df : pandas.DataFrame,
    metadata = None,
    categorical_metadata = None
)

Parameters:

df: pandas.DataFrame

A pandas DataFrame, with data consisting of only strings, integers, floats, booleans, and datetime objects.

metadata: Dict[str, Tuple(float,float)]

Metadata containing bounds of the given DataFrame. The metadata should be a dictionary with column names as keys mapped to their bounds. Metadata contains keys of those columns that have only numerical data.

{ 'Age': (18,65), 'Salary': (10000, 200000), 'Gender': (0,1) }

categorical_metadata: Dict[str, List]

Metadata containing information about the categorical data of the given DataFrame. The categorical_metadata should be a dictionary with column names as keys mapped to a list containing all the categories in the column. The data types for all the elements in the list must be identical.

{ 'Income' : [">50k", "<=50k"], 'Sex': ["M", "F"] }

General Functions

`applymap`

The applymap() method allows you to apply one or more functions to the DataFrame object, enabling the modification of each element independently.

Function Signature:

PrivateDataFrame.applymap(
    func,
    eps = 0,
    output_bounds = None
) -> PrivateDataFrame:

Parameters:

func: callable

Python function, returns a single value from a single value and should meet the following constraints:

Func can only take one argument, the individual element on which the function is applied.
Appropriate type annotations should be present in the function. To use datetime and regex, import datetime and import re to put their type annotations.

eps: float

The epsilon provided to the differentially private calculation. The eps value must be >=0. It’s used to calculate bounds.

output_bounds: Dict[str, Tuple[float, float]]

The output bounds (if already known) prevent the spending of epsilon from getting estimated bounds of the applied function.

categorical_output_bounds: Dict[str, List]

The categorical output bounds (if already known). If categorical output bounds for a specific column are not given, it will be calculated automatically using the function provided.

Returns:

PrivateDataFrame: A new DataFrame with the function applied to each element.

`all`

The all() method returns whether all elements are True, potentially over an axis.

Function Signature:

PrivateDataFrame.all(
    axis: int = 0,
    bool_only: bool = False,
    skipna: bool = False
) -> PrivateSeries:

Parameters:

axis: int, default 0

The axis to use. 0 is for rows, and 1 is for columns.

bool_only: bool, default False

Include only boolean columns. If False, all columns are included.

skipna: bool, default False

Exclude NA/null values when computing the result.

Returns:

PrivateSeries: A Series indicating whether all elements along a specified axis are True.

`categorical_metadata`

This method returns the metadata of the categorical columns in PrivateDataFrame.

Function Signature:

PrivateDataFrame.categorical_metadata -> dict

Returns:

dict: A dictionary containing metadata about the categorical columns in the DataFrame.

`columns`

This method returns the column names of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.columns -> list

Example:

>> priv_df.columns

['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
    'marital-status', 'occupation', 'relationship', 'race', 'gender',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']

Returns:

list: A list containing the names of the columns in the DataFrame.

`copy`

This method returns a copy of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.copy() -> PrivateDataFrame

Returns:

PrivateDataFrame: A new instance of PrivateDataFrame that is a copy of the original.

`describe`

The describe() method returns a statistical description of the data in the DataFrame, using differentially private calculations.

Function Signature:

PrivateDataFrame.describe(
    eps,
    percentiles = None,
    include = None,
    exclude = None
)-> pandas.DataFrame

Parameters:

eps: float

The epsilon provided to the differentially private calculation. eps must be >=0.

percentiles: list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include: ‘all’, list-like of dtypes or None (default), optional

all: All input columns will be included in the output.
A list-like of dtypes: Limits the results to the provided data types.
- To limit the result to numeric types, submit numpy.number.
- To limit the list to object columns submit the numpy.object data type.
- Strings can also be used in the select_dtypes style.
- To select pandas categorical columns, use category.
None (default): The result will include all numeric columns.

exclude: list-like of dtypes or None (default), optional

A list-like of dtypes : Excludes the provided data types from the result.
- To exclude numeric types submit numpy.number.
- To exclude object columns submit the data type numpy.object.
- Strings can also be used in the style of select_dtype (e.g. df.describe(exclude=['O'])).
- To exclude pandas’ categorical columns, use category.
None (default): No result will be excluded.

Returns:

pandas.DataFrame: A DataFrame object with the statistical description of the DataFrame’s columns, adjusted for privacy concerns.

`drop`

The drop() method removes the specified row or column from the PrivateDataFrame.

Function Signature:

PrivateDataFrame.drop(
    columns=None,
    inplace=True,
    errors='raise'
)

Parameters:

columns: single label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

inplace: boolean

Whether to operate in place on the data.

errors: {‘ignore’, ‘raise’}, default ‘raise’

If you use ignore, suppress the error, and only existing labels are dropped.

`dropna`

The dropna() method removes the rows that contain NULL values from the PrivateDataFrame.

Function Signature:

PrivateDataFrame.dropna(
    axis=0,
    how=_NoDefault.no_default,
    thresh=_NoDefault.no_default
)

Parameters:

axis: boolean {index (0), columns (1)}, default = 0

description

Understand

description

Understand

description

: Axis for the function to be applied on.
how: str {‘any’, ‘all’}, default ‘any’: Determine if a row or column is removed from DataFrame when we have at least one NA or all NA.
- any: If any NA values are present, drop that row or column.
- all: If all values are NA, drop that row or column.
thresh: int, optional: Defines how many existing non-NA values are required to remove the row. It cannot be combined with how.

`dtypes`

The dtypes property returns the data type of each column in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.dtypes

`fillna`

The fillna() method is used to replace missing values (NaNs) in a PrivateDataFrame. This method provides various options for filling missing data, either by specifying a static value or by using a method like 'forward fill' or 'backward fill'.

Function Signature:

PrivateDataFrame.fillna(
    value=None,
    limit: int = None,
    method=None,
    inplace: bool = False
):

Parameters:

value : scalar, dict, Series, DataFrame, or None, default None

The value used to fill missing entries. It can be a scalar, dictionary, Series, or DataFrame, providing great flexibility in how replacements are handled. If value is None and method is specified, it will perform the specified method of filling.

limit : int, optional

The maximum number of consecutive NaN values to forward/backward fill. The limit applies to the number of filled values.

method : {'backfill', 'bfill', 'pad', 'ffill', or None}, optional

The method to use when filling holes in reindexed Series:

'pad' or 'ffill': propagate last valid observation forward to next valid
'backfill' or 'bfill': use NEXT valid observation to fill gap

inplace : bool, default False

If True, fill in-place. Note that this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

Returns:

PrivateDataFrame or None: Depending on the value of inplace, it either returns a new DataFrame with missing values filled or modifies the original DataFrame and returns None.

`groupby`

The groupby() method on a PrivateDataFrame is crucial for data analysis, allowing data to be grouped based on specific criteria and operations like sum, mean, and count to be executed on these groups.

Function Signature:

PrivateDataFrame.groupby(
    by=None,
    sort: bool=True,
    dropna:
    bool=True
) -> PrivateDataFrameGroupby

Parameters:

by : str | List | pd.Series | op_pandas.PrivateSeries

Determines the groups for the groupby operation. Options include:

Column: Group by one or more categorical columns. The columns should be specified in the categorical_metadata.
Boolean Series / PrivateSeries: A series of boolean values. Non-boolean series will be converted to boolean before grouping.
List: A combination of column names and series.

sort : bool, default True

Controls whether the group keys are sorted. If set to False, the groups will appear in the order they are found in the original DataFrame.

dropna : bool, default True

If True, rows with NA values in the group keys are dropped. If False, NA values are included as a group key.

Allowed Operations:

After grouping, the following operations can be applied to compute statistics for each group:

Operation	Description
sum	Calculate the sum of group values.
mean	Compute the average of group values.
std	Standard deviation of the group values.
var	Variance of the group values.
count	Count of non-NA cells for each group.
quantile	Compute quantiles for each group.
median	Median of the group values.
percentile	Specific percentiles of group values.

Returns:

PrivateDataFrameGroupby: A specialized view of the DataFrame that supports further operations specific to groups.

Usage:

import op_pandas as opd

# Create a PrivateDataFrame with metadata
pdf = opd.PrivateDataFrame(
    df,
    metadata={"age": (0,100)},
    categorical_metadata={"groups": ['a', 'b', 'c']}
)

# Group the data by 'groups' column
grouped = pdf.groupby("groups")

# Print the sum of the 'age' column for each group, ensuring differential privacy
print(grouped.sum(eps=1))

Output Example:

>>>              age
    a  837986.678085
    b  817237.487139
    c  827334.458893

This example demonstrates how to group data by categories and apply a privacy-preserving sum operation, providing insights into the dataset while maintaining the confidentiality of the data.

`info`

The info() method provides a concise summary of a PrivateDataFrame, detailing attributes like column names, their data types, and additional metadata concerning bounds and categorical distinctions.

Usage:

private_df.info()

`isnull`

The isnull() method detects missing values for an array-like object.

Function Signature:

PrivateDataFrame.isnull() -> PrivateDataFrame:

`isna`

The isna() method detects missing values for an array-like object.

Function Signature:

PrivateDataFrame.isna() -> PrivateDataFrame:

`isin`

The isin() method checks if each element in a DataFrame is contained in the specified values.

Function Signature:

PrivateDataFrame.isin(values):

Parameters:

values: PrivateDataFrame

The result will only be valid at a location if all the labels match.

`join`

The join() method inserts columns from another DataFrame or Series.

Function Signature:

PrivateDataFrame.join(
    other,
    on = None,
    how = "left",
    lsuffix = "",
    rsuffix = "",
    sort = False,
    validate = None
) -> PrivateDataFrame

Parameters:

other: PrivateDataFrame, PrivateSeries

Index should be similar to one of the columns in this one. If a PrivateSeries is provided, its name attribute will be used as the column name in the resulting joined DataFrame.

on: str, list of str, or array-like, optional

Specifies in what level to do the joining.

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

left: use the calling frame’s index (or column if on is specified)
right: use the other’s index.
outer: form a union of the calling frame’s index (or column if one is specified) with the other’s index and sort it lexicographically.
inner: form the intersection of the calling frame’s index (or column if one is specified) with the other’s index, preserving the order of the calling’s one.

lsuffix: str, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffix: str, default ‘’

Suffix to use from right frame’s overlapping columns.

sort: bool, default False

Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

validate: str, optional

If specified, check if the join is of the specified type. The options are:

one_to_one or 1:1: Check if join keys are unique in both left and right datasets.
one_to_many or 1:m: Check if join keys are unique in the left dataset.
many_to_one or m:1: Check if join keys are unique in the right dataset.
many_to_many or m:m: This option is allowed but doesn’t result in checks.

`make_column_categorical`

The make_column_categorical converts a noncategorical column to a categorical one.

Function Signature:

PrivateDataFrame.make_column_categorical(
    column,
    categories,
    inplace=False
):

Parameters:

column: str

Column to be converted to categorical.

categories: List

List of categories to be used for the column.

inplace: bool

If True, the operation is done in place.

`make_column_non_categorical`

The make_column_non_categorical converts a categorical column to a noncategorical one.

Function Signature:

PrivateDataFrame.make_column_non_categorical(
    columns: str | List[str],
    output_bounds: dict = None,
    eps: float = 0.0
)

Parameters:

columns: str | List[str]

Column or a list of columns.

output_bounds: dict

If a column contains numerical values, but is categorical, you need to provide output bounds for it. If output bounds for a numerical column are absent, epsilon will be spent to estimate the bounds.

eps: float

Epsilon to estimate the output bounds of a numerical column.

`metadata`

The metadata returns the metadata or bounds of numerical columns present in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.metadata -> dict

`notnull`

The notnull() method detects non-missing values for an array-like object.

Function Signature:

PrivateDataFrame.notnull() -> PrivateDataFrame:

`notna`

The notna() method detects existing (non-missing) values.

Function Signature:

PrivateDataFrame.notna() -> PrivateDataFrame:

`one_hot_encoding`

The one_hot_encoding() method encodes the categorical columns of the PrivateDataFrame into one-hot vectors.

Function Signature:

PrivateDataFrame.one_hot_encoding(
    cols,
    prefix=None,
    prefix_sep="_"
) -> PrivateDataFrame:

Parameters:

cols: str | List[str]

Column or list of columns to be encoded.

prefix: str

Prefix to be used for the column names in the resulting PrivateDataFrame.

prefix_sep: str

Separator to be used between the prefix and the column name.

`rename`

This method renames a specific set of columns in the PrivateDataFrame. The rename method uses a dictionary, which should contain a key-value pair of the one-to-one mapping needed for the column replacement.

PrivateDataFrame.rename(dict) -> PrivateDataFrame

`sample_with_sensitivity`

The sample_with_sensitivity() method returns a random sample of items from the PrivateDataFrame, so that the sensitivity (how many times a user can be present in the dataset) is capped.

Function Signature:

PrivateDataFrame.sample_with_sensitivity(max_sensitivity) -> PrivateDataFrame:

Parameters:

max_sensitivity: int

The maximum number of times a user can be present in the dataset.

`size`

The size method returns the differentially private number of elements in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.size(eps: float = 0) -> int:

Parameters:

eps: float

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`unique`

The unique() method returns the unique values in a column of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.unique(column: str) -> PrivateSeries:

Parameters:

column: str

The column for which you want to find the unique values.

`where`

The where() method replaces the values of the rows where the condition evaluates to False.

Function Signature:

PrivateDataFrame.where(
    cond,
    other = None,
    inplace = False,
    axis = None,
    level = None
)

Parameters:

cond: bool PrivateSeries/PrivateDataFrame, Series/DataFrame, or array-like

If True, keep the original value.
If False, replace it with the corresponding value from the other.

other: None

Other tweaking is not supported currently.

inplace: bool, default False

Whether to operate in place on the data.

axis: int, default None

Alignment axis if needed. For Series, this parameter is unused and defaults to 0.

level: int, default None

Alignment level if needed.

Returns:

PrivateDataFrame: A DataFrame with the result, or None if the inplace parameter is set to True.

Basic statistical methods

`count`

The count() method counts the number of not empty values for each row or column if you specify the axis parameter as axis='columns'.

Function Signature:

PrivateDataFrame.count(
    eps = 0,
    axis=0,
    numeric_only=False
)

Parameters:

eps : float, default = 0

Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

numeric_only: bool, default None

Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True, otherwise, you must specify a value.

Returns:

Series: A Series object with the count result for each row/column.

`mean`

The mean() method returns the mean value of each column.

Function Signature:

PrivateDataFrame.mean(
    eps = 0,
    axis=0,
    skipna=True,
    numeric_only=None,
    **kwargs
)

Parameters:

eps : float, default = 0

Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

numeric_only: bool, default None

Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True, otherwise, you must specify a value.

*kwargs

Additional keyword arguments are to be passed to the function.

Returns:

Series: A Series with the mean values.

`median`

The median() method returns a Series with the median value of each column.

Function Signature:

PrivateDataFrame.median(eps)

Parameters:

eps: float

Inform the epsilon is provided for the differentially private calculation. The eps value must be >=0.

`percentile`

It is a differentially private implementation of the percentile method.

Function Signature:

PrivateDataFrame.percentile(eps, p)

Parameters:

eps: float

Inform the epsilon provided to the differentially private calculation. eps must be >=0.

p: float or array-like

A value between 0 <= p <= 100. The percentile(s) to compute.

`quantile`

The quantile() method calculates the quantile of the values in a given axis. The default axis is row.

Function Signature:

PrivateDataFrame.quantile(eps, q = 0.5)

Parameters:

eps: float

Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.

q: float or array-like, default 0.5 (50% quantile)

A value between 0 <= q <= 1, the quantile(s) to compute.

Standard deviation `std`

The standard deviation method, std(), returns the sample's standard deviation over a requested axis.

Function Signature:

PrivateDataFrame.std(
    eps = 0,
    axis=0,
    skipna=True,
    ddof=1,
    numeric_only=None,
    **kwargs
)

Parameters:

eps : float, default = 0

Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is $N - ddof$ , where N represents the number of elements. If axis = 0, ddof must be equal to 1.

numeric_only: bool, default None

Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.

*kwargs

The additional keyword arguments to be passed to the function.

`sum`

The sum() method adds all values in each column and returns the sum for each one.

Function Signature:

PrivateDataFrame.sum(
    eps = 0,
    axis=0,
    skipna=True,
    numeric_only=None,
    min_count=0,
    **kwargs
)

Parameters:

eps: float, default = 0

Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/Null values when computing the result.

numeric_only: bool, default None

Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.

min_count: int, default 0

The required number of valid values to operate.

If fewer than min_count non-NA values are present, the result will be NA.
If axis = 0, min_count is always assumed to be 0. Otherwise, you must specify a value.

*kwargs

Additional keyword arguments to be passed to the function.

variance `var`

The var() method calculates the variance for each column.

Function Signature:

PrivateDataFrame.var(eps = 0, axis=0, skipna=True, ddof=1, numeric_only=None, **kwargs)

Parameters:

eps : float, default = 0

Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/Null values when computing the result.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is $N - ddof$ , where N represents the number of elements. If axis = 0, ddof must be equal to 1.

numeric_only: bool, default None

Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.

*kwargs

Additional keyword arguments are to be passed to the function.

Advanced statistical methods

correlation `corr`

The corr() method finds the correlation of each column in a PrivateDataFrame.

Function Signature:

PrivateDataFrame.corr(eps: float, method: str = "pearson", min_periods: int = 1, numeric_only = True)

Parameters:

eps: float

Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.

method: str, {‘pearson’ or ‘spearman’}, default 'pearson'

Define the method used to calculate the correlation. The available options are:

pearson : standard correlation coefficient.
spearman : Spearman rank correlation.

min_periods: int, optional

Assumed to be 1. Currently, min_periods tweaking is not supported.

numeric_only: bool, default True

Include only float, int, or boolean data. Currently, numeric_only tweaking is not allowed.

Returns:

Pandas DataFrame: A DataFrame object with the correlation results.

covariance `cov`

The cov() method finds the covariance of each column in a PrivateDataFrame.

Function Signature:

PrivateDataFrame.cov(
    eps: float,
    min_periods,
    ddof = 1,
    numeric_only = True
)

Parameters:

eps: float

Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.

min_periods: int, optional

Assumed to be 1. Currently, min_periods tweaking is not supported.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is $N - ddof$ , where N represents the number of elements. Currently, ddof tweaking is not supported.

numeric_only: bool, default True

Include only float, int, or boolean data. Currently, numeric_only tweaking is not allowed.

Returns:

Pandas DataFrame: A DataFrame object with the covariance results.

`skew`

The skew() method calculates the skew for each column.

Function Signature:

PrivateDataFrame.skew(
    eps,
    axis = 0,
    skipna = True,
    numeric_only = True
)

Parameters:

eps : float, default = 0

Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/Null values when computing the result.

numeric_only: bool, default True

Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.

Returns:

Pandas DataFrame: A DataFrame object with the skew results.

Histograms

`hist`

This method draws a histogram of the PrivateDataFrame’s columns.

Function Signature:

PrivateDataFrame.hist(
    column,
    eps,
    bins = 10
)

Parameters:

column: str

Column in the PrivateDataFrame to group by.

eps: float

Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.

bins: int, default 10

Number of histogram bins to be used.

`hist2d`

This method creates a 2D histogram among two of the columns of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.hist2d(eps, x, y, bins = 10):

Parameters:

eps: float

The epsilon provided to the differentially private calculation. The eps value must be >=0.

x: str

Inform the first column to be used from the PrivateDataFrame to group by.

y: str

Inform the second column to be used from the PrivateDataFrame to group by.

bins: int, default 10

Number of histogram bins to be used.

PrivateSeries

The PrivateSeries API is based on pandas.Series, but in this case, all the methods are differentially private. PrivateSeries is available as part of op_pandas library in Antigranular.

Constructor

The PrivateSeries constructor is as follows:

class op_pandas.PrivateSeries(series = None, metadata = None, categorical_metadata = None)

The PrivateSeries parameters are described below:

series: pandas.Series

A Pandas Series, with data consisting of only strings, integers, floats, and booleans.

metadata: Tuple(float,float)

Metadata containing the bounds of the given Series. The metadata should be in the following form: (bound_low, bound_hi).

Note

If the Series contains string data, the metadata should not be provided.

categorical_metadata: List

Metadata containing information about the categorical data of the given Series. The categorical_metadata should be a list containing all the categories in the Series. The data types for all the elements in the list must be the same.

The code blocks below present two distinct examples of PrivateSeries:

    Series : [10, 20, 30, 40, 10, 42, 54]
    metadata : (0, 60)
    categorical_metadata : None

    Series : ["a", "b", "a", "b", "a", "a"]
    metadata : None
    categorical_metadata: ["a", "b"]

General Functions

PrivateSeries provides several internal functions you can use when working with series. The PrivateSeries general functions include:

`categorical_metadata`

This method returns the categorical_metadata of the PrivateSeries

PrivateSeries.categorical_metadata -> List

`copy`

The copy() method returns a copy of the PrivateSeries.

PrivateSeries.copy() -> PrivateSeries

`describe`

The describe() method returns a statistical description of the data in the DataFrame.

PrivateSeries.describe(eps, percentiles = None, include = None, exclude = None)

The available parameters of describe() are the following:

eps: float

The epsilon provided to the differentially private calculation. The eps value must be >=0.

percentiles: list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include: ‘all’, list-like of dtypes or None (default), optional

This option is ignored for Series.

exclude: list-like of dtypes or None (default), optional

A blocked list of data types to omit from the result. The available options are as follows:

A list-like of dtypes : Excludes the provided data types from the result.
- To exclude numeric types submit numpy.number.
- To exclude object columns submit the data type numpy.object.
- Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])).
- To exclude pandas’ categorical columns, use category.
None (default): The result will exclude nothing.

`dropna`

The dropna() method missing values within a PrivateSeries..

PrivateSeries.dropna(axis=0)

The available parameters of dropna() are the following:

axis: boolean {index (0), columns (1)}, default = 0

Not applied in Series.

`dtypes`

The dtypes property returns the data type information of the PrivateSeries.

PrivateSeries.dtypes

`isnull`

The isnull() method detects missing values for an array-like object.

PrivateSeries.isnull() -> PrivateSeries:

`isna`

The isna() method detects missing values for an array-like object.

PrivateSeries.isna() -> PrivateSeries:

`isin`

The isin() checks if each element in the DataFrame is contained in values.

PrivateSeries.isin(values):

The available parameters of isin() are the following:

values: PrivateDataFrame

The PrivateDataFrame against which each element in the Series is checked for containment.

`make_categorical`

This method makes the series categorical.

PrivateSeries.make_categorical(categories, inplace=False):

The available parameters of make_categorical() are the following:

categories: List

The categories to be used in the categorical metadata.

inplace: bool, default = False

If True, the operation will modify the data in place.

`make_series_non_categorical`

This method makes the series noncategorical.

PrivateSeries.make_series_non_categorical(output_bounds: tuple = None, eps: float = 0.0)

The available parameters of make_series_non_categorical() are the following:

output_bounds: tuple

When a series contains numerical values but is categorical, this parameter provides output bounds for it. In cases where output bounds for a numerical series aren’t provided, epsilon will be spent to estimate the bounds.

eps: float

The Epsilon to estimate the output bounds of a numerical column.

`map`

This method maps values of a PrivateSeries according to an input mapping or function.

PrivateSeries.map(arg, eps = 0, output_bounds = None, output_categories = None)

The available parameters of map() are the following:

arg: callable, mapping, pd.Series or PrivateSeries

If a mapping (dictionary) and the series have categorical data, all the categories in the metadata must have a mapping.

eps: float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0. It’s used to calculate the bonds.

output_bounds:Tuple[float, float]

Inform the output bounds. If not informed, Epsilon will be spend to get estimated bounds of the applied function.

output_categories: List

Inform the output categories if the current series is categorical. If not present, it will be calculated using arg.

info

If the input is a callable, it should return a single value when applied to each element. The output of the callable should be string, int, float, boolean, or datetime.

It's important to note that if the callable is a function, it will execute within an isolated environment with mypy strict mode enabled. The function must adhere to the following constraints:

The function can only accept one argument, which would be the individual element the function is being applied on.
Proper type annotations should be present within the function definition. To utilize datetime and regex, import datetime and re to enable their type annotations. For additional examples, access the Pandas quickstart guide.

`metadata`

The metadata method returns the metadata/bounds of a numerical series.

PrivateSeries.metadata -> tuple

The code block below presents an example of how to use metadata :

>> train_x.metadata

(0, 60)

`notnull`

The notnull() method detects non-missing values for an array-like object.

PrivateSeries.notnull() -> PrivateSeries:

`notna`

The notna() method detect existing (non-missing) values.

PrivateSeries.notna() -> PrivateSeries:

`one_hot_encoding`

This method performs one-hot encoding on the PrivateSeries.

PrivateSeries.one_hot_encoding(prefix=None, prefix_sep="_") -> PrivateDataFrame:

The available parameters of one_hot_encoding() are the following:

prefix: str, default None

Prefix to use for the column names.

prefix_sep: str, default '_'

Separator to use between the prefix and the column name.

`rename`

This method renames the column name of the PrivateSeries.

PrivateSeries.rename(name:str) -> PrivateSeries

`size`

The size method returns the differentially private number of elements in the PrivateSeries.

PrivateSeries.size(eps: float = 0) -> int:

The available parameters of size() are the following:

eps: float

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`sample_with_sensitivity`

The sample_with_sensitivity() method returns a random sample of items from the PrivateSeries, so that the sensitivity (how many times a user can be present in the dataset) is capped.

PrivateSeries.sample_with_sensitivity(max_sensitivity) -> PrivateSeries:

The available parameters of sample_with_sensitivity() are the following:

`max_sensitivity: int

The maximum number of times a user can be present in the dataset.

`unique`

The unique() method returns the unique values in the PrivateSeries.

PrivateDataFrame.unique() -> PrivateSeries:

`where`

The where() method replaces the values of the rows where the condition evaluates to False.

PrivateSeries.where(cond, other = None,inplace = False, axis = None, level = None)

The available parameters of where() are the following:

cond: bool PrivateSeries/PrivateDataFrame,Series/DataFrame array-like

Defines the condition, which should return True or False.

If True, keep the original value.
If False, replace it with the corresponding value from the other.

other: None

Currently, other tweaking isn’t supported.

inplace: bool, default False

Indicates whether the operation should modify the data in place.

axis: int, default None

This parameter isn’t used for Series. Defaults to 0.

level: int, default None

Alignment level if needed.

The method returns a PrivateSeries with the result, or None if the inplace parameter is set to True.

Basic statistical methods

`count`

The count() method returns the number of unempty values on the Series.

PrivateSeries.count(eps = 0)

The available parameters of count() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`mean`

The mean() method returns the mean value of the Series.

PrivateSeries.mean(eps = 0)

The available parameters of mean() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`median`

The median() method return the the median value of the values of the Series.

PrivateSeries.median(eps = 0)

The available parameters of median() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`percentile`

This method is a differentially private implementation of the percentile method.

PrivateSeries.percentile(p, eps)

The available parameters of percentile() are the following:

p: float

The percentile to compute. You must provide a value between 0 and 100.

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`quantile`

This method is a differentially private implementation of the quantile method.

PrivateSeries.quantile(q, eps)

The available parameters of **quantile**() are the following:

q: float

Inform a value between 0 and 1, which is the quantile to compute.

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

standard deviation `std`

The std() method returns the standard deviation of the sample data.

PrivateSeries.std(eps = 0, ddof = 1)

The available parameters of std() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is $N - ddof$ , where N represents the number of elements. Currently, the ddof tweaking is not supported.

`sum`

The sum() method adds all values in the Series.

PrivateSeries.sum(eps = 0)

The available parameters of sum() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

`variance`

The variance() method calculates the variance from the Series.

PrivateSeries.var(eps = 0, ddof = 1)

The available parameters of variance() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is $N - ddof$ , where N represents the number of elements. Currently, the ddof tweaking is not supported.

Advanced statistical methods

The PrivateSeries basic statical methods include:

covariance `cov`

The cov() method finds the covariance of two PrivateSeries.

PrivateSeries.cov(other, eps: float, min_periods, ddof = 1)

The available parameters of cov() are the following:

other: PrivateSeries

The second PrivateSeries.

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

min_periods: int, optional

By default, 1 is used. Currently, min_periods tweaking is not supported.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is $N - ddof$ , where N represents the number of elements. Currently, the ddof tweaking is not supported.

`skew`

The skew() method calculates the skew for the PrivateSeries.

PrivateSeries.skew(eps, axis = 0, skipna = True, numeric_only = True)

The available parameters of skew() are the following:

eps : float, default = 0

The epsilon provided to the differentially private calculation. The eps value must be >=0.

axis: boolean {index (0), columns (1)}, default = 0

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/Null values when computing the result.

numeric_only: bool, default None

Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.

Histograms

`hist`

This method draws a a histogram of the PrivateSeries.

PrivateSeries.hist(eps, bins = 10)

The available parameters of hist() are the following:

eps: float

Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.

bins: int, default 10

Number of histogram bins to be used.

`hist2d`

This method creates a 2d histograma of two PrivateSeries.

PrivateSeries.hist2d(other, eps, bins = 10)

The available parameters of hist2d() are the following:

other: PrivateSeries

The second PrivateSeries.

eps: float

Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.

bins: int, default 10

Number of histogram bins to be used.

Info

The PrivateSeries API is based on pandas.Series, but in this case, all the methods are differentially private. PrivateSeries is available as part of op_pandas library in Antigranular.

General Methods

This page showcases some of the most commonly used Panda methods available in op_pandas and their parameters.

`concat`

The concat() function is used to concatenate Panda's objects, such as PrivateSeries and PrivateDataFrames, along a specified axis. This function also supports creating a hierarchical index on the concatenation axis if needed, and handles the set logic of the indexes on the non-concatenation axes through optional union or intersection.

def concat(
    objs,
    *,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    sort=False,
    copy=None,
)->PrivateData:

Parameters:

objs : array of PrivateSeries | PrivateDataFrame

An array that includes PrivateDataFrames or PrivateSeries for concatenation. If any element within the array is None, it will be silently dropped unless all elements are None, in which case a ValueError will be raised.

axis : {0}, default 0

Specifies the axis along which to concatenate the objects. Currently, only concatenation along axis=0 is allowed.

join : {'inner', 'outer'}, default 'outer'

Dictates how to handle the indexes on the axes other than the concatenation axis.

'outer': Uses the union of indexes. |
'inner': Uses the intersection of indexes. |

ignore_index : bool, default False

If set to True, the index values along the concatenation axis will be ignored. The resulting axis will be labeled from 0 to n - 1. This is particularly useful when the original index does not carry meaningful information for the concatenated result.

keys : sequence, default None

Used to create a hierarchical index on the concatenation axis, with the elements of the sequence forming the outermost level.

levels : list of sequences, default None

Specifies the levels to use for constructing a MultiIndex, if not inferred from the keys.

names : list, default None

Provides names for the levels in the resulting hierarchical index.

verify_integrity : bool, default False

Verification of integrity during concatenation is not supported in this function.

sort : bool, default False

Determines whether to sort the non-concatenation axis if it is not already aligned.

copy : True

The copy parameter is not supported in this version of the function.

Usage:

combined_df = op_pandas.concat([df1, df2], ignore_index=True, join='inner')

Note

The datatypes along a single column must be the same, or the concatenation won't happen.

`merge`

The merge() function facilitates the merging of PrivateDataFrame or named PrivateSeries objects, mimicking database-style joins. This function allows for various types of joins, handling indexes and columns differently based on the type of merge specified.

def merge(
    left,
    right,
    how="inner",
    on=None,
    *args, **kwargs
)-> PrivateData:

Parameters:

left : PrivateDataFrame or named PrivateSeries

The left object in the merge. A named PrivateSeries is treated as a PrivateDataFrame with a single column.

right : PrivateDataFrame or named PrivateSeries

The right object in the merge. Similarly, a named PrivateSeries is treated as a PrivateDataFrame with a single column.

how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'

Specifies the type of merge to perform:

'left': Perform a left outer join, using only keys from the left frame. The order of keys is preserved. |
'right': Perform a right outer join, using only keys from the right frame. The order of keys is preserved. |
'outer': Perform a full outer join, using the union of keys from both frames. Keys are sorted lexicographically. |
'inner': Perform an inner join, using the intersection of keys from both frames. The order of the left keys is preserved. |
'cross': Create a Cartesian product of both frames, preserving the order of the left keys. Note: No columns to merge on can be specified in a cross join. |

Usage:

When columns are specified for a join, index information of the PrivateDataFrames is ignored. However, when joining on indexes, whether with each other or with columns, index information is preserved, which is crucial for alignments where index continuity is necessary.

result = op_pandas.merge(left_df, right_df, how='inner', on='key_column')

`to_datetime`

The to_datetime() function converts an input scalar, array-like, PrivateSeries, or PrivateDataFrame into a Panda's datetime object, handling a wide range of datetime formats and providing various options for customization and error handling.

def to_datetime(
    arg,
    errors="ignore",
    dayfirst=False,
    yearfirst=False,
    utc=False,
    format=None,
    exact=_NoDefault.no_default,
    unit=None,
    infer_datetime_format=_NoDefault.no_default,
    origin="unix",
    cache=True,
)-> PrivateData:

Parameters:

arg : PrivateSeries

The data to convert to datetime format. For DataFrames, it should contain the columns "year", "month", and "day", with years in a four-digit format.

errors : str, default 'ignore'

'ignore': If parsing fails, return the original input.
'raise': Raise an error if parsing fails.
'coerce': Set unparsable entries to NaT (Not a Time).

dayfirst : bool, default False

Influences parsing order if arg is string-like. If True, interprets the first number in a date string as the day (e.g., 10/11/12 becomes 2012-11-10).

yearfirst : bool, default False

Influences parsing order if arg is string-like. If True, interprets the first number in a date string as the year (e.g., 10/11/12 becomes 2010-11-12).

Note

If both dayfirst and yearfirst are True, yearfirst takes precedence, similar to the behavior in dateutil.

utc : bool, default False

If True, returns a UTC-localized Timestamp, Series, or DatetimeIndex.
If False, returns data without timezone conversion, maintaining original time offsets where present.

format : str, default None

The format string to use for parsing dates, like %d/%m/%Y. Special options include:

'ISO8601': Parse any ISO8601 formatted string.
'mixed': Infer the format for each element, use cautiously as recommended by Antigranular.

exact : bool, default True

If True, the format string must be precisely matched.
If False, allows the format to match anywhere in the target string.
Note: Incompatible with format='ISO8601' or format='mixed'.

unit : str, default 'ns'

Defines the unit for numeric input based on the origin. Common units include 'D' (days), 's' (seconds), 'ms' (milliseconds), etc.

infer_datetime_format : bool, default False

When True and no format is specified, attempts to infer the datetime format, potentially speeding up parsing significantly.

origin : scalar, default 'unix'

Defines the reference date for numeric inputs. Possible values:
- 'unix': Start from 1970-01-01.
- 'Julian': Start from Julian Calendar day zero.
- Timestamp convertible values or numeric offsets relative to 1970-01-01.

cache : bool, default True

Utilizes a cache for converted dates to enhance parsing speed for repeated date strings, especially those with timezone offsets. Not effective for out-of-bounds values.

Example Usage:

datetime_data = op_pandas.to_datetime(series_data, errors='coerce', dayfirst=True, format='%d/%m/%Y')

Note

If both day first and year first are True, year first is preceded (same as dateutil).
Cannot be used alongside format='ISO8601' or format='mixed'.

`train_test_split`

The train_test_split() method is used to split the PrivateDataFrame or PrivateSeries into a training set and a testing set, which is essential for training models in a manner that can evaluate their performance effectively.

def train_test_split(
    df,
    test_size=0.25,
    random_state=None,
    stratify=None
)-> Tuple[PrivateData , PrivateData]:

Parameters:

df : list | PrivateDataFrame | PrivateSeries

Accepts either a single PrivateDataFrame, a PrivateSeries, or a list of these. The list does not need to contain elements of the same size; however, if they are of the same size, they will be split in the same way in terms of indices.

test_size : float, default 0.25

This specifies the proportion of the dataset to include in the test split. It must be between 0 and 1.

random_state : int | None, default None

Provides a seed value to ensure reproducibility of the split.

stratify : None

Currently, stratification is not supported, meaning the data will be split without considering the distribution of outcomes across the training and testing sets.

Example Usage:

train_data, test_data = op_pandas.train_test_split(df, test_size=0.3, random_state=42)

`standard_scaler`

This function standardizes features by removing the mean and scaling to unit variance, applying differential privacy techniques to ensure the data privacy is maintained.

def standard_scaler(
    data,
    eps
)-> PrivateData:

Parameters:

data : PrivateDataFrame | PrivateSeries

This is the input data, which should be either a PrivateDataFrame or a PrivateSeries. It contains the features that need to be standardized.

eps : float

Represents the epsilon budget for differential privacy. A smaller epsilon value means stronger privacy guarantees but potentially less accuracy in the scaled data.

Returns:

The function does not explicitly return a type in the signature provided, but it likely returns a PrivateDataFrame or PrivateSeries with the standardized features.

Usage:

scaled_data = op_pandas.standard_scaler(data, eps=0.1)

`label_encoder`

This function performs label encoding on one or more categorical columns of a DataFrame or a Series. It returns a tuple containing the transformed data and a dictionary mapping the original categories to their encoded labels.

def label_encoder(
    df,
    cols = None
) -> Tuple[ PrivateData , dict]:

Parameters:

df : PrivateDataFrame | PrivateSeries

This is the input data which should be of type PrivateDataFrame or PrivateSeries, containing categorical data that needs to be encoded.

cols : List | str | None

Specifies the columns to be label encoded. You can provide a single column name as a string, a list of column names, or None. If None is provided and the input is a DataFrame, No columns are considered for encoding. This parameter is ignored if the input is a PrivateSeries.

Returns:

Tuple[PrivateData, dict]: A tuple where the first element is the label-encoded data (as a PrivateDataFrame or PrivateSeries) and the second element is a dictionary that maps the original categorical values to their respective integer labels.

Usage:

encoded_data, mapping = op_pandas.label_encoder(df, cols=['category_column'])

Overview
PrivateDataFrame
PrivateSeries
General Methods

Overview​

PrivateDataFrame​

Constructor​

Constructor:​

Parameters:​

General Functions​

applymap​

Function Signature:​

Parameters:​

Returns:​

all​

Function Signature:​

Parameters:​

Returns:​

categorical_metadata​

Function Signature:​

Returns:​

columns​

Function Signature:​

Example:​

Returns:​

copy​

Function Signature:​

Returns:​

describe​

Function Signature:​

Parameters:​

Returns:​

drop​

Function Signature:​

Parameters:​

dropna​

Function Signature:​

Parameters:​

dtypes​

Function Signature:​

fillna​

Function Signature:​

Parameters:​

Returns:​

groupby​

Function Signature:​

Parameters:​

Allowed Operations:​

Returns:​

Usage:​

Output Example:​

info​

Usage:​

isnull​

Function Signature:​

isna​

Function Signature:​

isin​

Function Signature:​

Parameters:​

join​

Function Signature:​

Parameters:​

make_column_categorical​

Function Signature:​

Parameters:​

make_column_non_categorical​

Function Signature:​

Parameters:​

metadata​

Function Signature:​

notnull​

Function Signature:​

notna​

Function Signature:​

one_hot_encoding​

Function Signature:​

Parameters:​

rename​

sample_with_sensitivity​

Function Signature:​

Parameters:​

size​

Overview

PrivateDataFrame

Constructor

Constructor:

Parameters:

General Functions

`applymap`

Function Signature:

Parameters:

Returns:

`all`

Function Signature:

Parameters:

Returns:

`categorical_metadata`

Function Signature:

Returns:

`columns`

Function Signature:

Example:

Returns:

`copy`

Function Signature:

Returns:

`describe`

Function Signature:

Parameters:

Returns:

`drop`

Function Signature:

Parameters:

`dropna`

Function Signature:

Parameters:

`dtypes`

Function Signature:

`fillna`

Function Signature:

Parameters:

Returns:

`groupby`

Function Signature:

Parameters:

Allowed Operations:

Returns:

Usage:

Output Example:

`info`

Usage:

`isnull`

Function Signature:

`isna`

Function Signature:

`isin`

Function Signature:

Parameters:

`join`

Function Signature:

Parameters:

`make_column_categorical`

Function Signature:

Parameters:

`make_column_non_categorical`

Function Signature:

Parameters:

`metadata`

Function Signature:

`notnull`

Function Signature:

`notna`

Function Signature:

`one_hot_encoding`

Function Signature:

Parameters:

`rename`

`sample_with_sensitivity`

Function Signature:

Parameters:

`size`

Function Signature: