Multi-Table Metadata Extraction
Column Metadata
Overview
This document describes the key attributes extracted for each column when analyzing database table metadata. The following table summarizes each attribute and its meaning:
Attribute Descriptions
| Attribute | Description |
|---|---|
| Column Name | The name of the column as defined in the database table. |
| Metadata DType | The data type of the column (e.g., Numeric, String, Boolean, DateTime, Bytes). |
| Nullable | Indicates whether the column can contain NULL values. |
| Unique | Specifies if all values in this column must be unique across the table. |
| Autoincrement | Indicates if the column automatically increments its value (commonly used for numeric IDs). |
| Default Value | The default value assigned to the column if no value is provided when inserting a new row. |
| Is Primary Key | A boolean representing if the current column is the primary key for the table. |
| Original Type Meta | Original data type information as received from the database. |
Metadata DTypes
This section documents the structure of the supported metadata dtypes.
Data Type Classes With Attributes
1. Bool
Represents a column containing boolean values. Depending on the database, these may be stored as true/false, 1/0, or similar representations.
1. Bool
Represents a column containing boolean values. Depending on the database, these may be stored astrue/false, 1/0, or similar representations.| Attribute | Description |
|---|---|
| Name | A literal containing the name of the class. |
| True Probability | This is used to determine the probability of a true value. |
2. Bytes
Represents a column containing raw binary data (e.g., files, images, or other non-text data).
2. Bytes
Represents a column containing raw binary data (e.g., files, images, or other non-text data).| Attribute | Description |
|---|---|
| Name | A literal containing the name of the class. |
| Min Size Bytes | The minimum size of the bytes data. |
| Max Size Bytes | The maximum size of the bytes data. |
3. Numeric
Represents a column containing numeric values (e.g., integers, floats, decimals).
3. Numeric
Represents a column containing numeric values (e.g., integers, floats, decimals).| Attribute | Description |
|---|---|
| Name | A literal containing the name of the class. |
| Numeric Type | The numeric subtype: Integer, Float, or Decimal. |
| Categories | A list of values that are the categorical values for the columns. |
| Min Value | Minimum value for the column. Populated if no categories are found. |
| Max Value | Maximum value for the column. Populated if no categories are found. |
4. String
Represents a column containing text or character data.
4. String
Represents a column containing text or character data.| Attribute | Description |
|---|---|
| Name | A literal containing the name of the class. |
| Categories | A list of values that are the categorical values for the columns. |
| Min Length | Minimum length for the column. Populated if no categories are found. |
| Max Length | Maximum length for the column. Populated if no categories are found. |
| Semantic Type | Semantic type for the column. This determines how test data for the column can be generated. |
5. Datetime
Represents a column containing date and/or time values (e.g., timestamps, dates).
5. Datetime
Represents a column containing date and/or time values (e.g., timestamps, dates).| Attribute | Description |
|---|---|
| Name | A literal containing the name of the class. |
| Categories | A list of values that are the categorical values for the columns. |
| Is Date Only | A boolean representing whether the current column is a date only column. |
| Is Time Only | A boolean representing whether the current column is a time only column. |
| Min Value | Minimum value for the column. Populated if no categories are found. |
| Max Value | Maximum value for the column. Populated if no categories are found. |
6. Unsupported
Unsupported Column Type
6. Unsupported
Unsupported Column Type| Attribute | Description |
|---|---|
| Name | A literal containing the name of the class. |
Semantic Types
The semantic type describes what the data represents (for example, an email address, a physical address, contact, sha etc.), in contrast to the underlying data type (for example, string or integer). Semantic types can be used to generate realistic test data for the columns they are associated with.
The following semantic types are available:
Faker Semantic Type
Faker is a python package, used to generate fake strings of specific formats like address, names, contact numbers, uuids, sha etc.
Generation Prompt Semantic Type
This type stores a prompt that, when passed to an LLM, generates values similar to those observed in the column.
Regex Semantic Type
This type stores a regular expression that matches the structure of values in the column.
Supported SQL Types by Database
This section documents the supported SQL types for each database. SQL types are grouped by data type class.
MySQL
MySQL
| Data Type Class | SQL Types |
|---|---|
| Bool | BOOLEAN, BOOL, BIT |
| Bytes | BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, BINARY, VARBINARY |
| DateTime | DATE, DATETIME, TIMESTAMP, TIME, YEAR |
| Numeric | INT, INTEGER, BIGINT, SMALLINT, TINYINT, MEDIUMINT, FLOAT, DOUBLE, DECIMAL |
| String | CHAR, VARCHAR, TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT, ENUM, SET, JSON |
| Unsupported | (none) |
PostgreSQL
PostgreSQL
| Data Type Class | SQL Types |
|---|---|
| Bool | BOOLEAN |
| Bytes | BYTEA |
| DateTime | DATE, TIMESTAMP, TIME, TIMESTAMP WITHOUT TIME ZONE, TIMESTAMP WITH TIME ZONE, TIME WITHOUT TIME ZONE, TIME WITH TIME ZONE |
| Numeric | INT, INTEGER, BIGINT, SMALLINT, FLOAT, DOUBLE, REAL, NUMERIC, DOUBLE PRECISION |
| String | CHAR, VARCHAR, TEXT, CHARACTER |
| Unsupported | TEXT[], INET, CIDR, INTEGER[], INT4RANGE, INT8RANGE, NUMRANGE, DATERANGE, TSRANGE, TSTZRANGE, JSONB, JSON, XML, BOX, CIRCLE, LINE, LSEG, PATH, POINT, POLYGON, INTERVAL, MACADDR, MACADDR8, UUID, ENUM, BIT, BIT VARYING |
Snowflake
Snowflake
| Data Type Class | SQL Types |
|---|---|
| Bool | BOOLEAN, BIT |
| Bytes | BINARY, VARBINARY |
| DateTime | DATE, TIME, TIMESTAMP, TIMESTAMP_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ |
| Numeric | NUMBER, DECIMAL, NUMERIC, INT, INTEGER, BIGINT, SMALLINT, FLOAT, FLOAT4, FLOAT8, DOUBLE, DOUBLE PRECISION, REAL |
| String | STRING, TEXT, VARCHAR, CHAR, CHARACTER |
| Unsupported | (none) |
Redshift
Redshift
| Data Type Class | SQL Types |
|---|---|
| Bool | BOOLEAN, BOOL |
| Bytes | (none) |
| DateTime | DATE, TIME, TIMESTAMP, TIMESTAMPTZ, TIMETZ, TIME WITHOUT TIME ZONE, TIMESTAMP WITHOUT TIME ZONE |
| Numeric | SMALLINT, INTEGER, INT2, INT4, INT8, BIGINT, DECIMAL, NUMERIC, REAL, DOUBLE PRECISION, FLOAT4, FLOAT8 |
| String | VARCHAR, CHAR, CHARACTER, CHARACTER VARYING, TEXT |
| Unsupported | (none) |
Databricks
Databricks
| Data Type Class | SQL Types |
|---|---|
| Bool | BOOLEAN |
| Bytes | BINARY |
| DateTime | DATE, TIMESTAMP |
| Numeric | BYTE, SHORT, INT, BIGINT, LONG, FLOAT, DOUBLE, DECIMAL |
| String | STRING |
| Unsupported | (none) |
Oracle OCI
Oracle OCI
| Data Type Class | SQL Types |
|---|---|
| Bool | (none) |
| Bytes | RAW, LONG RAW, BLOB |
| DateTime | DATE, TIMESTAMP, TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH LOCAL TIME ZONE, INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND |
| Numeric | NUMBER, FLOAT, BINARY_FLOAT, BINARY_DOUBLE |
| String | CHAR, NCHAR, VARCHAR2, NVARCHAR2, LONG |
| Unsupported | CLOB, NCLOB |
MongoDB
MongoDB
| Data Type Class | SQL Types |
|---|---|
| Bool | BIT |
| Bytes | VARBINARY |
| DateTime | TIMESTAMP |
| Numeric | INT, BIGINT, FLOAT, DECIMAL |
| String | VARCHAR, BSON:OBJECTID |
| Unsupported | BSON:ARRAY, BSON:MINKEY, BSON:MAXKEY, BSON:REGEX, BSON:NULL, BSON:CODEWSCOPE, BSON:SYMBOL |
CSV
CSV
| Data Type Class | SQL Types |
|---|---|
| Bool | BIT |
| Bytes | (none) |
| DateTime | DATE, TIME, TIMESTAMP |
| Numeric | INT, FLOAT, BIGINT, DOUBLE, DECIMAL |
| String | VARCHAR |
| Unsupported | (none) |
JSON
JSON
| Data Type Class | SQL Types |
|---|---|
| Bool | BIT |
| Bytes | (none) |
| DateTime | DATE, TIME, TIMESTAMP |
| Numeric | INT, FLOAT, BIGINT, DOUBLE, DECIMAL |
| String | VARCHAR |
| Unsupported | (none) |
EXCEL
EXCEL
| Data Type Class | SQL Types |
|---|---|
| Bool | BIT |
| Bytes | (none) |
| DateTime | DATE, TIME, TIMESTAMP |
| Numeric | INT, FLOAT, BIGINT, DOUBLE, DECIMAL |
| String | VARCHAR |
| Unsupported | (none) |
Parquet
Parquet
| Data Type Class | SQL Types |
|---|---|
| Bool | BIT |
| Bytes | (none) |
| DateTime | DATE, TIME, TIMESTAMP |
| Numeric | INT, FLOAT, BIGINT, DOUBLE, DECIMAL |
| String | VARCHAR |
| Unsupported | (none) |