Skip to main content

Module partition

Module partition 

Source
Expand description

Hive-style partition path discovery and helpers to resolve glob patterns or explicit file lists — single-process only (no distributed coordinator).

§Hive-style layout rules

A common batch layout (e.g. Apache Hive / Spark) stores files under directories whose names are key=value pairs, for example:

warehouse/my_table/dt=2024-01-01/region=us/part-00000.csv

Rules used here:

  • Discovery starts at a root directory you provide.
  • For each file under that root, the parent path relative to root is split into path components. Every directory component must match key=value where both sides are non-empty (split on the first =). The filename itself is not a partition segment.
  • A file placed directly under root (no partition directories) has an empty partition prefix.
  • If any directory component is not of the form key=value, that file is skipped (not returned). This avoids mis-classifying folders like staging/ or _temporary/.
  • This crate does not validate that partition keys match your schema; callers may attach PartitionSegments as extra columns after ingest in a later pipeline step.

§Glob and explicit lists

  • paths_from_glob expands a filesystem glob (e.g. data/**/*.parquet) to existing files.
  • paths_from_explicit_list checks that each path exists and is a file, then returns them in order (deduplicated while preserving first occurrence).
  • paths_from_directory_scan walks a directory tree and returns matching files in sorted path order (see Deterministic ordering below).

§Deterministic ordering (incremental batches)

For repeatable pipelines, these helpers define a stable sequence:

Structs§

PartitionSegment
One hive-style directory segment key=value.
PartitionedFile
A data file discovered under a hive-style tree, with parsed partition segments.

Functions§

discover_hive_partitioned_files
Discover files under root whose parent path (relative to root) consists only of hive-style key=value directory segments.
hive_segments_for_relative_parent
Parse every directory component of relative_parent as hive segments.
parse_partition_segment
Parse a single path component as key=value.
paths_from_directory_scan
Recursively list files under root, optionally filtered by a glob on the path relative to root, then sort for deterministic ordering.
paths_from_explicit_list
Validate and return an explicit list of file paths (must each exist and be a file).
paths_from_glob
Expand a filesystem glob pattern and return existing regular files, sorted by path.