Expand description
Hive-style partition path discovery and helpers to resolve glob patterns or explicit file lists — single-process only (no distributed coordinator).
§Hive-style layout rules
A common batch layout (e.g. Apache Hive / Spark) stores files under directories whose names are
key=value pairs, for example:
warehouse/my_table/dt=2024-01-01/region=us/part-00000.csvRules used here:
- Discovery starts at a root directory you provide.
- For each file under that root, the parent path relative to root is split into path
components. Every directory component must match
key=valuewhere both sides are non-empty (split on the first=). The filename itself is not a partition segment. - A file placed directly under
root(no partition directories) has an empty partition prefix. - If any directory component is not of the form
key=value, that file is skipped (not returned). This avoids mis-classifying folders likestaging/or_temporary/. - This crate does not validate that partition keys match your schema; callers may attach
PartitionSegments as extra columns after ingest in a later pipeline step.
§Glob and explicit lists
paths_from_globexpands a filesystem glob (e.g.data/**/*.parquet) to existing files.paths_from_explicit_listchecks that each path exists and is a file, then returns them in order (deduplicated while preserving first occurrence).paths_from_directory_scanwalks a directory tree and returns matching files in sorted path order (see Deterministic ordering below).
§Deterministic ordering (incremental batches)
For repeatable pipelines, these helpers define a stable sequence:
paths_from_directory_scan,paths_from_glob, anddiscover_hive_partitioned_filessort results byPathBuf(lexicographic / component-wise per the standard library).paths_from_explicit_listpreserves caller order (deduplicates while keeping first occurrence).
Structs§
- Partition
Segment - One hive-style directory segment
key=value. - Partitioned
File - A data file discovered under a hive-style tree, with parsed partition segments.
Functions§
- discover_
hive_ partitioned_ files - Discover files under
rootwhose parent path (relative toroot) consists only of hive-stylekey=valuedirectory segments. - hive_
segments_ for_ relative_ parent - Parse every directory component of
relative_parentas hive segments. - parse_
partition_ segment - Parse a single path component as
key=value. - paths_
from_ directory_ scan - Recursively list files under
root, optionally filtered by a glob on the path relative toroot, then sort for deterministic ordering. - paths_
from_ explicit_ list - Validate and return an explicit list of file paths (must each exist and be a file).
- paths_
from_ glob - Expand a filesystem glob pattern and return existing regular files, sorted by path.