General Methods
General Methods
This page showcases some of the most commonly used Panda methods available in op_pandas and their parameters.
concat
concat
The concat() function is used to concatenate Panda's objects, such as PrivateSeries and PrivateDataFrames, along a specified axis. This function also supports creating a hierarchical index on the concatenation axis if needed, and handles the set logic of the indexes on the non-concatenation axes through optional union or intersection.
def concat(
objs,
*,
axis=0,
join="outer",
ignore_index=False,
keys=None,
levels=None,
names=None,
verify_integrity=False,
sort=False,
copy=None,
)->PrivateData:
Parameters:
objs : array of PrivateSeries | PrivateDataFrameAn array that includes PrivateDataFrames or PrivateSeries for concatenation. If any element within the array is None, it will be silently dropped unless all elements are None, in which case a ValueError will be raised.
axis : {0}, default 0Specifies the axis along which to concatenate the objects. Currently, only concatenation along axis=0 is allowed.
join : {'inner', 'outer'}, default 'outer'Dictates how to handle the indexes on the axes other than the concatenation axis.
'outer': Uses the union of indexes. |'inner': Uses the intersection of indexes. |
ignore_index : bool, default FalseIf set to True, the index values along the concatenation axis will be ignored. The resulting axis will be labeled from 0 to n - 1. This is particularly useful when the original index does not carry meaningful information for the concatenated result.
keys : sequence, default NoneUsed to create a hierarchical index on the concatenation axis, with the elements of the sequence forming the outermost level.
levels : list of sequences, default NoneSpecifies the levels to use for constructing a MultiIndex, if not inferred from the keys.
names : list, default NoneProvides names for the levels in the resulting hierarchical index.
verify_integrity : bool, default FalseVerification of integrity during concatenation is not supported in this function.
sort : bool, default FalseDetermines whether to sort the non-concatenation axis if it is not already aligned.
copy : TrueThe copy parameter is not supported in this version of the function.
Usage:
combined_df = op_pandas.concat([df1, df2], ignore_index=True, join='inner')
The datatypes along a single column must be the same, or the concatenation won't happen.
merge
merge
The merge() function facilitates the merging of PrivateDataFrame or named PrivateSeries objects, mimicking database-style joins. This function allows for various types of joins, handling indexes and columns differently based on the type of merge specified.
def merge(
left,
right,
how="inner",
on=None,
*args, **kwargs
)-> PrivateData:
Parameters:
left : PrivateDataFrame or named PrivateSeriesThe left object in the merge. A named PrivateSeries is treated as a PrivateDataFrame with a single column.
right : PrivateDataFrame or named PrivateSeriesThe right object in the merge. Similarly, a named PrivateSeries is treated as a PrivateDataFrame with a single column.
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'Specifies the type of merge to perform:
'left': Perform a left outer join, using only keys from the left frame. The order of keys is preserved. |'right': Perform a right outer join, using only keys from the right frame. The order of keys is preserved. |'outer': Perform a full outer join, using the union of keys from both frames. Keys are sorted lexicographically. |'inner': Perform an inner join, using the intersection of keys from both frames. The order of the left keys is preserved. |'cross': Create a Cartesian product of both frames, preserving the order of the left keys. Note: No columns to merge on can be specified in a cross join. |
Usage:
When columns are specified for a join, index information of the PrivateDataFrames is ignored. However, when joining on indexes, whether with each other or with columns, index information is preserved, which is crucial for alignments where index continuity is necessary.
result = op_pandas.merge(left_df, right_df, how='inner', on='key_column')
to_datetime
to_datetime
The to_datetime() function converts an input scalar, array-like, PrivateSeries, or PrivateDataFrame into a Panda's datetime object, handling a wide range of datetime formats and providing various options for customization and error handling.
def to_datetime(
arg,
errors="ignore",
dayfirst=False,
yearfirst=False,
utc=False,
format=None,
exact=_NoDefault.no_default,
unit=None,
infer_datetime_format=_NoDefault.no_default,
origin="unix",
cache=True,
)-> PrivateData:
Parameters:
arg : PrivateSeriesThe data to convert to datetime format. For DataFrames, it should contain the columns "year", "month", and "day", with years in a four-digit format.
errors : str, default 'ignore''ignore': If parsing fails, return the original input.'raise': Raise an error if parsing fails.'coerce': Set unparsable entries to NaT (Not a Time).
dayfirst : bool, default FalseInfluences parsing order if arg is string-like. If True, interprets the first number in a date string as the day (e.g., 10/11/12 becomes 2012-11-10).
yearfirst : bool, default FalseInfluences parsing order if arg is string-like. If True, interprets the first number in a date string as the year (e.g., 10/11/12 becomes 2010-11-12).
If both dayfirst and yearfirst are True, yearfirst takes precedence, similar to the behavior in dateutil.
utc : bool, default False- If True, returns a UTC-localized Timestamp, Series, or DatetimeIndex.
- If False, returns data without timezone conversion, maintaining original time offsets where present.
format : str, default NoneThe format string to use for parsing dates, like %d/%m/%Y. Special options include:
'ISO8601': Parse any ISO8601 formatted string.'mixed': Infer the format for each element, use cautiously as recommended by Antigranular.
exact : bool, default True- If True, the format string must be precisely matched.
- If False, allows the format to match anywhere in the target string.
- Note: Incompatible with
format='ISO8601'orformat='mixed'.
unit : str, default 'ns'Defines the unit for numeric input based on the origin. Common units include 'D' (days), 's' (seconds), 'ms' (milliseconds), etc.
infer_datetime_format : bool, default FalseWhen True and no format is specified, attempts to infer the datetime format, potentially speeding up parsing significantly.
origin : scalar, default 'unix'- Defines the reference date for numeric inputs. Possible values:
'unix': Start from 1970-01-01.'Julian': Start from Julian Calendar day zero.- Timestamp convertible values or numeric offsets relative to 1970-01-01.
cache : bool, default TrueUtilizes a cache for converted dates to enhance parsing speed for repeated date strings, especially those with timezone offsets. Not effective for out-of-bounds values.
Example Usage:
datetime_data = op_pandas.to_datetime(series_data, errors='coerce', dayfirst=True, format='%d/%m/%Y')
- If both day first and year first are True, year first is preceded (same as
dateutil). - Cannot be used alongside format='ISO8601' or format='mixed'.
train_test_split
train_test_split
The train_test_split() method is used to split the PrivateDataFrame or PrivateSeries into a training set and a testing set, which is essential for training models in a manner that can evaluate their performance effectively.
def train_test_split(
df,
test_size=0.25,
random_state=None,
stratify=None
)-> Tuple[PrivateData , PrivateData]:
Parameters:
df : list | PrivateDataFrame | PrivateSeriesAccepts either a single PrivateDataFrame, a PrivateSeries, or a list of these. The list does not need to contain elements of the same size; however, if they are of the same size, they will be split in the same way in terms of indices.
test_size : float, default 0.25This specifies the proportion of the dataset to include in the test split. It must be between 0 and 1.
random_state : int | None, default NoneProvides a seed value to ensure reproducibility of the split.
stratify : NoneCurrently, stratification is not supported, meaning the data will be split without considering the distribution of outcomes across the training and testing sets.
Example Usage:
train_data, test_data = op_pandas.train_test_split(df, test_size=0.3, random_state=42)
standard_scaler
standard_scaler
This function standardizes features by removing the mean and scaling to unit variance, applying differential privacy techniques to ensure the data privacy is maintained.
def standard_scaler(
data,
eps
)-> PrivateData:
Parameters:
data : PrivateDataFrame | PrivateSeriesThis is the input data, which should be either a PrivateDataFrame or a PrivateSeries. It contains the features that need to be standardized.
eps : floatRepresents the epsilon budget for differential privacy. A smaller epsilon value means stronger privacy guarantees but potentially less accuracy in the scaled data.
Returns:
The function does not explicitly return a type in the signature provided, but it likely returns a PrivateDataFrame or PrivateSeries with the standardized features.
Usage:
scaled_data = op_pandas.standard_scaler(data, eps=0.1)
label_encoder
label_encoder
This function performs label encoding on one or more categorical columns of a DataFrame or a Series. It returns a tuple containing the transformed data and a dictionary mapping the original categories to their encoded labels.
def label_encoder(
df,
cols = None
) -> Tuple[ PrivateData , dict]:
Parameters:
df : PrivateDataFrame | PrivateSeriesThis is the input data which should be of type PrivateDataFrame or PrivateSeries, containing categorical data that needs to be encoded.
cols : List | str | NoneSpecifies the columns to be label encoded. You can provide a single column name as a string, a list of column names, or None. If None is provided and the input is a DataFrame, No columns are considered for encoding. This parameter is ignored if the input is a PrivateSeries.
Returns:
Tuple[PrivateData, dict]: A tuple where the first element is the label-encoded data (as a PrivateDataFrame or PrivateSeries) and the second element is a dictionary that maps the original categorical values to their respective integer labels.
Usage:
encoded_data, mapping = op_pandas.label_encoder(df, cols=['category_column'])