This document describes behavior shared by
processing::reduce, processing::multi, and
Polars-backed pipeline::DataFrame helpers
(reduce, group_by,
feature_wise_mean_std).
ReduceOp::Count: counts
rows in the dataset (includes nulls in that
column).CountNotNull / non-null counts: count
only non-null cells.CountDistinctNonNull: distinct values
among non-null cells only (null is not a distinct
category).Value::Null
(not 0).0, std dev is
0.Value::Null (undefined).Value::Null.group_by note: some engines
return 0 for
sum over an all-null group; mean /
std in Polars typically stay null for all-null
groups. Prefer mean / std when you need “no data” vs
“zero total”.f64 (Welford for
variance). Very large Int64 values
converted to f64 can differ slightly from Polars at the ULP
level; integration tests allow a small relative
tolerance in those cases.CastMode)cast /
cast_with_mode control strict vs lossy casts into
our logical types; feature_wise_mean_std
and scalar DataFrame::reduce use
strict cast to Polars Float64 for numeric
stats, consistent with explicit
CastMode::Strict behavior for those
expressions.pipeline::Agg supports per-group:
ReduceOp::Median (in-memory) and
Agg::Median (group-by): nulls ignored;
result is always Value::Float64. Even
element count: average of the two middle order-statistics (aligned with
typical SQL / Polars median behavior).Combine multiple Agg variants in one
DataFrame::group_by call for feature summaries keyed by
categorical columns.
feature_wise_mean_std: one scan over
rows; all listed columns must be Int64 or
Float64.arg_max_row /
arg_min_row: first row index on ties.top_k_by_frequency: non-null value
counts, sorted by count desc then value key (stable tie-break).Copy-pastable Rust snippets (filter/map/reduce, mean/variance/std,
DataFrame::reduce, feature_wise_mean_std, arg
max/min, top‑k, group_by with Agg) live
in:
API.md — section Processing
pipelines (Epic 1 / Story 1.2)README.md — Processing
pipelines, Cookbook → group-by, and the ML-oriented
subsections under processing