Skip to main content
Version: devel



def get_column_type_from_py_arrow(dtype: pyarrow.DataType) -> TColumnType


Returns (data_type, precision, scale) tuple from pyarrow.DataType


def remove_null_columns(item: TAnyArrowItem) -> TAnyArrowItem


Remove all columns of datatype pyarrow.null() from the table or record batch


def remove_columns(item: TAnyArrowItem,
columns: Sequence[str]) -> TAnyArrowItem


Remove columns from Arrow item


def append_column(item: TAnyArrowItem, name: str, data: Any) -> TAnyArrowItem


Appends new column to Table or RecordBatch


def rename_columns(item: TAnyArrowItem,
new_column_names: Sequence[str]) -> TAnyArrowItem


Rename arrow columns on Table or RecordBatch, returns same data but with renamed schema


def normalize_py_arrow_item(item: TAnyArrowItem,
columns: TTableSchemaColumns,
naming: NamingConvention,
caps: DestinationCapabilitiesContext,
load_id: Optional[str] = None) -> TAnyArrowItem


Normalize arrow item schema according to the columns.

  1. arrow schema field names will be normalized according to naming
  2. arrows columns will be reordered according to columns
  3. empty columns will be inserted if they are missing, types will be generated using caps
  4. arrow columns with different nullability than corresponding schema columns will be updated
  5. Add _dlt_load_id column if it is missing and load_id is provided


def get_normalized_arrow_fields_mapping(schema: pyarrow.Schema,
naming: NamingConvention) -> StrStr


Normalizes schema field names and returns mapping from original to normalized name. Raises on name collisions


def py_arrow_to_table_schema_columns(
schema: pyarrow.Schema) -> TTableSchemaColumns


Convert a PyArrow schema to a table schema columns dict.


  • schema pyarrow.Schema - pyarrow schema


  • TTableSchemaColumns - table schema columns


def get_parquet_metadata(
parquet_file: TFileOrPath) -> Tuple[int, pyarrow.Schema]


Gets parquet file metadata (including row count and schema)


  • parquet_file str - path to parquet file


  • FileMetaData - file metadata


def to_arrow_scalar(value: Any, arrow_type: pyarrow.DataType) -> Any


Converts python value to an arrow compute friendly version


def from_arrow_scalar(arrow_value: pyarrow.Scalar) -> Any


Converts arrow scalar into Python type. Currently adds "UTC" to naive date times and converts all others to UTC


Sequence of tuples: (field index, field, generating function)


def add_constant_column(item: TAnyArrowItem,
name: str,
data_type: pyarrow.DataType,
value: Any = None,
nullable: bool = True,
index: int = -1) -> TAnyArrowItem


Add column with a single value to the table.


  • item - Arrow table or record batch
  • name - The new column name
  • data_type - The data type of the new column
  • nullable - Whether the new column is nullable
  • value - The value to fill the new column with
  • index - The index at which to insert the new column. Defaults to -1 (append)


def pq_stream_with_new_columns(
parquet_file: TFileOrPath,
columns: TNewColumns,
row_groups_per_read: int = 1) -> Iterator[pyarrow.Table]


Add column(s) to the table in batches.

The table is read from parquet row_groups_per_read row groups at a time


  • parquet_file - path or file object to parquet file
  • columns - list of columns to add in the form of (insertion index, pyarrow.Field, column_value_callback) The callback should accept a pyarrow.Table and return an array of values for the column.
  • row_groups_per_read - number of row groups to read at a time. Defaults to 1.


pyarrow.Table objects with the new columns added.


def cast_arrow_schema_types(
schema: pyarrow.Schema, type_map: Dict[Callable[[pyarrow.DataType], bool],
Callable[..., pyarrow.DataType]]
) -> pyarrow.Schema


Returns type-casted Arrow schema.

Replaces data types for fields matching a type check in type_map. Type check functions in type_map are assumed to be mutually exclusive, i.e. a data type does not match more than one type check function.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!


Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.