Skip to main content

Multi-Table Metadata Extraction

Column Metadata

Overview

This document describes the key attributes extracted for each column when analyzing database table metadata. The following table summarizes each attribute and its meaning:

Attribute Descriptions

AttributeDescription
Column NameThe name of the column as defined in the database table.
Metadata DTypeThe data type of the column (e.g., Numeric, String, Boolean, DateTime, Bytes).
NullableIndicates whether the column can contain NULL values.
UniqueSpecifies if all values in this column must be unique across the table.
AutoincrementIndicates if the column automatically increments its value (commonly used for numeric IDs).
Default ValueThe default value assigned to the column if no value is provided when inserting a new row.
Is Primary KeyA boolean representing if the current column is the primary key for the table.
Original Type MetaOriginal data type information as received from the database.

Metadata DTypes

This section documents the structure of the supported metadata dtypes.

Data Type Classes With Attributes

1. Bool

Represents a column containing boolean values. Depending on the database, these may be stored as true/false, 1/0, or similar representations.
AttributeDescription
NameA literal containing the name of the class.
True ProbabilityThis is used to determine the probability of a true value.

2. Bytes

Represents a column containing raw binary data (e.g., files, images, or other non-text data).
AttributeDescription
NameA literal containing the name of the class.
Min Size BytesThe minimum size of the bytes data.
Max Size BytesThe maximum size of the bytes data.

3. Numeric

Represents a column containing numeric values (e.g., integers, floats, decimals).
AttributeDescription
NameA literal containing the name of the class.
Numeric TypeThe numeric subtype: Integer, Float, or Decimal.
CategoriesA list of values that are the categorical values for the columns.
Min ValueMinimum value for the column. Populated if no categories are found.
Max ValueMaximum value for the column. Populated if no categories are found.

4. String

Represents a column containing text or character data.
AttributeDescription
NameA literal containing the name of the class.
CategoriesA list of values that are the categorical values for the columns.
Min LengthMinimum length for the column. Populated if no categories are found.
Max LengthMaximum length for the column. Populated if no categories are found.
Semantic TypeSemantic type for the column. This determines how test data for the column can be generated.

5. Datetime

Represents a column containing date and/or time values (e.g., timestamps, dates).
AttributeDescription
NameA literal containing the name of the class.
CategoriesA list of values that are the categorical values for the columns.
Is Date OnlyA boolean representing whether the current column is a date only column.
Is Time OnlyA boolean representing whether the current column is a time only column.
Min ValueMinimum value for the column. Populated if no categories are found.
Max ValueMaximum value for the column. Populated if no categories are found.

6. Unsupported

Unsupported Column Type
AttributeDescription
NameA literal containing the name of the class.

Semantic Types

The semantic type describes what the data represents (for example, an email address, a physical address, contact, sha etc.), in contrast to the underlying data type (for example, string or integer). Semantic types can be used to generate realistic test data for the columns they are associated with.

The following semantic types are available:

Faker Semantic Type

Faker is a python package, used to generate fake strings of specific formats like address, names, contact numbers, uuids, sha etc.

Generation Prompt Semantic Type

This type stores a prompt that, when passed to an LLM, generates values similar to those observed in the column.

Regex Semantic Type

This type stores a regular expression that matches the structure of values in the column.

Supported SQL Types by Database

This section documents the supported SQL types for each database. SQL types are grouped by data type class.


MySQL

Data Type ClassSQL Types
BoolBOOLEAN, BOOL, BIT
BytesBLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, BINARY, VARBINARY
DateTimeDATE, DATETIME, TIMESTAMP, TIME, YEAR
NumericINT, INTEGER, BIGINT, SMALLINT, TINYINT, MEDIUMINT, FLOAT, DOUBLE, DECIMAL
StringCHAR, VARCHAR, TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT, ENUM, SET, JSON
Unsupported(none)

PostgreSQL

Data Type ClassSQL Types
BoolBOOLEAN
BytesBYTEA
DateTimeDATE, TIMESTAMP, TIME, TIMESTAMP WITHOUT TIME ZONE, TIMESTAMP WITH TIME ZONE, TIME WITHOUT TIME ZONE, TIME WITH TIME ZONE
NumericINT, INTEGER, BIGINT, SMALLINT, FLOAT, DOUBLE, REAL, NUMERIC, DOUBLE PRECISION
StringCHAR, VARCHAR, TEXT, CHARACTER
UnsupportedTEXT[], INET, CIDR, INTEGER[], INT4RANGE, INT8RANGE, NUMRANGE, DATERANGE, TSRANGE, TSTZRANGE, JSONB, JSON, XML, BOX, CIRCLE, LINE, LSEG, PATH, POINT, POLYGON, INTERVAL, MACADDR, MACADDR8, UUID, ENUM, BIT, BIT VARYING

Snowflake

Data Type ClassSQL Types
BoolBOOLEAN, BIT
BytesBINARY, VARBINARY
DateTimeDATE, TIME, TIMESTAMP, TIMESTAMP_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ
NumericNUMBER, DECIMAL, NUMERIC, INT, INTEGER, BIGINT, SMALLINT, FLOAT, FLOAT4, FLOAT8, DOUBLE, DOUBLE PRECISION, REAL
StringSTRING, TEXT, VARCHAR, CHAR, CHARACTER
Unsupported(none)

Redshift

Data Type ClassSQL Types
BoolBOOLEAN, BOOL
Bytes(none)
DateTimeDATE, TIME, TIMESTAMP, TIMESTAMPTZ, TIMETZ, TIME WITHOUT TIME ZONE, TIMESTAMP WITHOUT TIME ZONE
NumericSMALLINT, INTEGER, INT2, INT4, INT8, BIGINT, DECIMAL, NUMERIC, REAL, DOUBLE PRECISION, FLOAT4, FLOAT8
StringVARCHAR, CHAR, CHARACTER, CHARACTER VARYING, TEXT
Unsupported(none)

Databricks

Data Type ClassSQL Types
BoolBOOLEAN
BytesBINARY
DateTimeDATE, TIMESTAMP
NumericBYTE, SHORT, INT, BIGINT, LONG, FLOAT, DOUBLE, DECIMAL
StringSTRING
Unsupported(none)

Oracle OCI

Data Type ClassSQL Types
Bool(none)
BytesRAW, LONG RAW, BLOB
DateTimeDATE, TIMESTAMP, TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH LOCAL TIME ZONE, INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND
NumericNUMBER, FLOAT, BINARY_FLOAT, BINARY_DOUBLE
StringCHAR, NCHAR, VARCHAR2, NVARCHAR2, LONG
UnsupportedCLOB, NCLOB

MongoDB

Data Type ClassSQL Types
BoolBIT
BytesVARBINARY
DateTimeTIMESTAMP
NumericINT, BIGINT, FLOAT, DECIMAL
StringVARCHAR, BSON:OBJECTID
UnsupportedBSON:ARRAY, BSON:MINKEY, BSON:MAXKEY, BSON:REGEX, BSON:NULL, BSON:CODEWSCOPE, BSON:SYMBOL

CSV

Data Type ClassSQL Types
BoolBIT
Bytes(none)
DateTimeDATE, TIME, TIMESTAMP
NumericINT, FLOAT, BIGINT, DOUBLE, DECIMAL
StringVARCHAR
Unsupported(none)

JSON

Data Type ClassSQL Types
BoolBIT
Bytes(none)
DateTimeDATE, TIME, TIMESTAMP
NumericINT, FLOAT, BIGINT, DOUBLE, DECIMAL
StringVARCHAR
Unsupported(none)

EXCEL

Data Type ClassSQL Types
BoolBIT
Bytes(none)
DateTimeDATE, TIME, TIMESTAMP
NumericINT, FLOAT, BIGINT, DOUBLE, DECIMAL
StringVARCHAR
Unsupported(none)

Parquet

Data Type ClassSQL Types
BoolBIT
Bytes(none)
DateTimeDATE, TIME, TIMESTAMP
NumericINT, FLOAT, BIGINT, DOUBLE, DECIMAL
StringVARCHAR
Unsupported(none)