Delta / Iceberg → `DataSet` (limits and handoff)

Delta / Iceberg → DataSet (limits and handoff)

What this library does not ship (Phase 2)

1. Export to Parquet (simplest)

Use Spark, Databricks, Python deltalake, or Trino to COPY / write a Parquet directory or single file, then:

use rust_data_processing::ingestion::{ingest_from_path, IngestionOptions};
// ingest_from_path(..., IngestionOptions::default()) for .parquet

2. Arrow RecordBatch handoff (Rust, --features arrow)

Read batches with your tool of choice, then:

use rust_data_processing::transform::arrow::record_batches_to_dataset;
use rust_data_processing::types::Schema;
// schema must match the logical columns you need (Int64, Float64, Bool, Utf8).
let ds = record_batches_to_dataset(&[batch1, batch2], &schema)?;

See rustdoc on record_batches_to_dataset for schema alignment rules.

3. Python

Use deltalake or pyiceberg to scan a table to PyArrow, write Parquet, then rust_data_processing.ingest_from_path on that Parquet path; or serialize batches and use project-specific glue if you need in-process Arrow.

When to use Spark / Databricks

Use a cluster engine for large tables, ACID maintenance, ZORDER, liquid clustering, Iceberg branching, or governance features. Use this library for local QA, transforms, validation, and smaller extracts you land as files.