Skip to main content

Creating Datasets

Overview

Users can create datasets to perform data analysis. The dataset creation process consists of following steps:

Step 1
Access the Datasets menu on the sidebar.
Step 2
Click Add a Dataset.

You are directed to the first step of the dataset creation process, the Overview tab, where you can input the dataset's basic information.

Overview

Step 3
Overview

This step is where you provide the essential details about your dataset. These details help organize and identify your dataset within the platform, ensuring it is discoverable and properly described for future use.

1
Fill in the following fields:
1
Dataset Name

The name of the dataset.

2
Description

A description of the dataset for all other users to see.

3
Visibility

Defines whether the dataset is Public or Private.

  • Private: The dataset will only be visible to its creator.
  • Public: The dataset will be visible to all users of the platform.
4
Tags

Create topic tags associated with the dataset. These tags allow users to quickly check information about the dataset.

2
Once all fields are filled in, click Next.
Step 4
Source Information

The Source Information step is where you define the origin of the information for the dataset.

Source-information

Steps to define Source Information:

1
Choose a source type

Choose the data source to be used. Options are:

  • csv
  • MongoDB
  • json
  • parquet
  • excel
  • mysql
  • snowflake
  • postgresql
  • oracleoci
  • databricks
  • redshift
2
Choose a Connection Type

Define the origin of the data source.

  • For csv, json and parquet, it can be a file from local storage or an HTTP, HTTPS or Amazon S3 URL.
  • In the case of MongoDB, you can input the details and credentials to connect to the server hosting the database.
Local file connection

When File is chosen as the Connection Type, the next step of the creation process will allow you to upload the datasource file.

3
Once all fields are filled in, click Next.
Step 5
File Upload

This step is only available when choosing File as the Connection Type in the previous step. It allows you to upload the local dataset file.

file-Upload

After uploading the file, click Next.

Step 6
Data Selection

In the Data Selection step, you can configure the dataset’s metadata.

There are two ways to configure dataset metadata: the single-table flow for simple datasets, and the multi-table flow for managing metadata across multiple related tables.

The table and its columns for the metadata can be selected by either choosing them from the dropdown selection or writing a custom query. See the sections below to learn how:

Choosing tables and columns

This option is the easiest and simplest way to choose the columns for the metadata. Follow the steps below:

1
Select the Choose Table tab.
2
In the dropdown selections, choose the dataset and the columns to use.

choosing-tables

3
Click Preview to display a preview of the chosen columns.
4
The metadata table is displayed below it. You can define the Metadata and its Type for the chosen columns.

metadata

Calculate with DP

If you wish to calculate the metadata field using Differential Privacy, click Calculate with DP, and a pop-up window will be displayed where you can input the Epsilon to be spent on the calculation.

5
Click Next to finish and proceed to the next step.

Using a Query

This option is more complex but offers a more configurable approach to selecting tables and columns. Follow the steps below:

1
Select the Query tab.
2
Write the query in the text box. Above it, two dropdown boxes display the table and column information that can be copied and pasted into the query.

query

3
After writing your query, click Preview to see if it works correctly and to display a preview.
4
The metadata table is displayed below it. You can define the Metadata and its Type for the chosen columns.
Calculate with DP

If you wish to calculate the Metadata field using Differential Privacy, click Calculate with DP, and a pop-up window will be displayed where you can input the Epsilon to be spent on the calculation.

5
Click Next to finish and proceed to the next step.
Step 7
Privacy Budget

The last step of the creation process is the Privacy Budget tab, where you define the dataset's lifetime Epsilon and Delta values. These amounts can be allocated to data scientists and teams, who will then use them to perform queries.

Tip

See more information regarding these parameters on the Differential Privacy page.

This Privacy Budget can later be allocated by Team Admins to the team members using the dataset.

privacy-budget

Step 8
Publish Dataset

After setting all the values, click Publish Dataset to conclude the creation process.

The dataset will now be available in the datasets list and ready for use by teams or users. The Admin tag indicates you possess the dataset admin role for this specific dataset.

published-dataset