DataSet (limits and handoff)rust-data-processing build
(see Planning/ADR_P2_E2_LAKE_TABLE_READ.md).Use Spark, Databricks,
Python deltalake, or
Trino to COPY / write a Parquet directory
or single file, then:
use rust_data_processing::ingestion::{ingest_from_path, IngestionOptions};
// ingest_from_path(..., IngestionOptions::default()) for .parquetRecordBatch handoff (Rust,
--features arrow)Read batches with your tool of choice, then:
use rust_data_processing::transform::arrow::record_batches_to_dataset;
use rust_data_processing::types::Schema;
// schema must match the logical columns you need (Int64, Float64, Bool, Utf8).
let ds = record_batches_to_dataset(&[batch1, batch2], &schema)?;See rustdoc on record_batches_to_dataset for schema
alignment rules.
Use deltalake or
pyiceberg to scan a table to
PyArrow, write Parquet, then
rust_data_processing.ingest_from_path on that Parquet path;
or serialize batches and use project-specific glue if you need
in-process Arrow.
Use a cluster engine for large tables, ACID maintenance, ZORDER, liquid clustering, Iceberg branching, or governance features. Use this library for local QA, transforms, validation, and smaller extracts you land as files.