This document explains where credentials live when
rust-data-processing or rdp_jvm_sys reads and
writes s3://, gs://, abfss://,
and related URIs. For per-connector URLs and copy-paste examples, see CONNECTORS.md.
Fake values below are placeholders only.
Open this guide as the file
docs/CLOUD_AUTH.md— it is a single markdown file, not a folder. If the editor says “is a directory”, you clicked a broken#fragmentlink. Use dedicated files: AMAZON_S3.md · AZURE_ADLS.md · SNOWFLAKE.md.
Rust performs cloud I/O. Python and Java are thin wrappers: they pass URIs and pipeline JSON across FFI; they do not pass access tokens, Azure AD secrets, or AWS keys in that JSON.
Credentials are resolved by the object_store
crate inside the process that loaded the native library
(rdp_jvm_sys .so / .dylib, Python
extension, or Rust binary). Those credentials come from the
operating-system environment of that process — not from
Java APIs, not from System.getenv configuration inside your
application code unless your launcher actually exported vars into the
process first.
Java / Python Rust (same process)
───────────── ─────────────────
pipeline JSON ──FFI──► parse URI → object_store::parse_url_opts
(location only) │
▼
credential chain from env / MSI / IAM / keys
│
▼
GET/PUT to s3:// / abfss:// / gs://
Implementation entry points:
src/ingestion/object_store.rs
— ingest_from_object_store_uri,
export_dataset_to_object_store_urisrc/ingestion/delta_lake.rs
— delta_table_uri,
write_dataset_to_delta_table (Parquet under the table
path)bindings/jvm-sys/src/pipeline_run.rs
— sources.object_store_uris, sinks databricks,
object_store, snowflake stage URIsparse_url_opts is called with no
credential map from callers — only the URL string.
Names like AWS_ACCESS_KEY_ID,
AZURE_CLIENT_SECRET, and
GOOGLE_APPLICATION_CREDENTIALS are standard OS /
process environment variables. They are not a
special “Java environment” or JVM system property namespace.
| What people sometimes assume | What actually happens |
|---|---|
Set vars in Java code with System.setProperty |
Does not work for object_store — Rust
reads the process env block your OS provides at
startup |
| Configure only in IDE “Environment” for a Java main | Works only if that IDE/runner exports those vars
into the process before rdp_jvm_sys loads
(same as any native library) |
Put secrets in application.properties |
Ignored by Rust I/O unless your launcher copies them into real env vars before calling FFI |
Who must see the variables: the single OS
process that loads rdp_jvm_sys
(e.g. java …, python …, or a Rust binary).
When Java calls native code, Rust runs in the same
process as the JVM — so Docker/K8s env injection for the
container (or pod) is what matters.
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
java -jar your-etl.jar # JVM inherits the shell’s environmentInject at container level — not inside Java source:
# Prefer runtime injection (secrets manager, --env-file) over baking secrets into the image
ENV AWS_REGION=us-east-1
# Do NOT commit real keys in Dockerfile layersdocker run --env-file /secure/rdp.env your-image:tag
# or
docker run \
-e AWS_ACCESS_KEY_ID=... \
-e AWS_SECRET_ACCESS_KEY=... \
-e AZURE_TENANT_ID=... \
-e AZURE_CLIENT_ID=... \
-e AZURE_CLIENT_SECRET=... \
your-image:tagUse a .env file only on the host or in
CI to populate docker run --env-file; keep
.env out of git (.gitignore). For production,
prefer a secret store that mounts or injects env at deploy time.
Map secrets to pod environment variables (or use workload identity so fewer static keys are needed):
apiVersion: v1
kind: Pod
spec:
containers:
- name: etl
image: your-registry/rdp-etl:latest
envFrom:
- secretRef:
name: rdp-cloud-credentials # keys: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, …
# Optional single vars:
env:
- name: AZURE_STORAGE_ACCOUNT_NAME
value: "storacc01"
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/var/secrets/gcp/sa.json"
volumeMounts:
- name: gcp-sa
mountPath: /var/secrets/gcp
readOnly: true
volumes:
- name: gcp-sa
secret:
secretName: gcp-service-accountAzure / AWS on K8s: often you omit static keys and
bind a ServiceAccount to IAM (EKS) or use
Workload Identity (AKS) so the pod gets credentials
without AWS_* / AZURE_CLIENT_SECRET in a
Secret — still platform env/metadata, not Java
config.
Same rule: set env on the process
(export in shell, docker run -e, K8s
env, systemd Environment=, etc.).
maturin run / cargo run inherit the parent
shell unless you inject vars in the job definition.
| Store / protocol | URI examples | OS / process env (see sections below) | In pipeline JSON? |
|---|---|---|---|
| Amazon S3 | s3://bucket/key |
AWS_* or IAM role |
Location only |
| Google Cloud Storage | gs:// / gcs:// |
GOOGLE_APPLICATION_CREDENTIALS or GCE/GKE identity |
Location only |
| Azure ADLS | abfss://, azure:// |
AZURE_* / MSI / account key |
Location only |
| Snowflake | Often s3://… for stage I/O |
Stage: AMAZON_S3.md; COPY
optional: SNOWFLAKE_* |
Account URL in sink JSON; not storage secrets |
| Databricks warehouse | abfss:// or s3:// under
warehouse |
Same as Azure or S3 for that URI | warehouse path only; PAT not used in-tree |
| SFTP | sftp://… |
SFTP_PASSWORD, SFTP_PRIVATE_KEY_PATH,
optional SFTP_USER |
URI in file_transfer_uris only |
| FTP / FTPS | ftp:// / ftps:// |
FTP_PASSWORD, optional FTP_USER |
URI in file_transfer_uris only |
| Layer | What it protects | Used by this repo for abfss:// I/O? |
Where you configure it |
|---|---|---|---|
| Cloud storage (ADLS Gen2, S3, GCS) | Read/write blobs at abfss://…, s3://…,
gs://… |
Yes — all real bytes go through
object_store |
OS env / MSI / IAM on the container or host process |
Databricks workspace (REST, notebooks, cluster
OAuth, PAT dapi…) |
Databricks APIs, SQL warehouses, cluster UI | No for in-tree sinks today | Databricks / Spark outside this FFI path |
Full guide (AWS env vars, IAM role, Docker, K8s, Java/Rust/Python): AMAZON_S3.md.
| Method | Environment / host setup |
|---|---|
| Service account JSON | GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json |
| GCE / GKE workload identity | Metadata server on the VM or pod — no path in JSON |
| User ADC (local dev) | gcloud auth application-default login on the machine
running Rust |
URI:
gs://demo-gcs-project/rdp/incoming/part-00000.parquet
(validation also accepts gcs://).
GOOGLE_APPLICATION_CREDENTIALS in the shell, then call
ingest_from_object_store_uri /
export_dataset_to_object_store_uri.maturin process.object_store_uris or sink JSON; set
GOOGLE_APPLICATION_CREDENTIALS on the
pod/container/process (Docker --env-file,
K8s env, etc.) — not via Java-only config files alone."object_store_uris": ["gs://demo-gcs-project/rdp/incoming/part-00000.parquet"]Full guide (env vars, Java/Rust/Python, Databricks
abfss:// warehouse): AZURE_ADLS.md.
kind: databricks)Java (and Rust/Python via the same layout) often include:
{
"kind": "databricks",
"workspace_url": "https://dbc-a1b2c3d4-e5f6.cloud.databricks.com",
"catalog_uri": "https://dbc-a1b2c3d4-e5f6.cloud.databricks.com/api/2.1/unity-catalog/iceberg",
"warehouse": "abfss://datalake@storacc01.dfs.core.windows.net/unity/",
"namespace": "main.curated",
"table": "fact_scores"
}| Field | Role in-tree today |
|---|---|
warehouse |
Required — abfss:// or
s3:// root; Rust builds
…/namespace/table/part-rdp-000.parquet and writes via
object_store |
namespace, table |
Path layout (main.curated → main/curated/
under the warehouse) |
workspace_url, catalog_uri |
Metadata only — echoed in
sink_results; no HTTP call or PAT/OAuth
use in Rust |
Auth you must configure: for abfss://
warehouse → AZURE_ADLS.md; for
s3:// warehouse → AMAZON_S3.md.
A Databricks PAT (dapi…) or
workspace OAuth app does not
authenticate the in-tree write. Those are for Databricks REST, SQL
warehouses, and Spark drivers you run separately.
Full Delta transaction logs (ACID, time travel) are not committed
yet; see delta_lake.rs.
Full guide (stage AWS_*, optional
SNOWFLAKE_*, Docker/K8s, Java/Rust/Python): SNOWFLAKE.md.
Rust writes Parquet to handoff_uri (s3://,
abfss://, or file://).
| Concern | Where auth lives |
|---|---|
Rust write to handoff_uri |
AMAZON_S3.md or AZURE_ADLS.md env on the OS process |
| Spark read in your cluster | Your spark-submit / Databricks cluster
(Kerberos, PAT, OAuth) — not Rust FFI |
See CONNECTORS.md — Apache Spark.
Status: Implemented in
rust-data-processing and rdp_jvm_sys when
built with cloud_connectors (Cargo feature
file_transfer).
URL shape:
sftp://etl_user:FAKE_SFTP_PASS@sftp.example.com:22/rdp/incoming/data.parquet
| Auth | Notes |
|---|---|
| Password | User in URL; SFTP_PASSWORD env
overrides URL password — do not commit real passwords to git |
| SSH private key | SFTP_PRIVATE_KEY_PATH — path on the
host running Rust / JVM / Python native code |
| Username only in env | SFTP_USER when the URL omits a
user |
Pipeline JSON — declare the URI only (no secrets in JSON):
"file_transfer_uris": ["sftp://etl_user@sftp.example.com:22/rdp/incoming/data.parquet"]Rust downloads the remote file to a temp path, then uses the same
CSV/JSON/Parquet/XML readers as local ingest. Set
sources.options.format when the extension is ambiguous.
Fallback: land files on S3/ADLS/GCS/local with your
own SFTP client, then use object_store_uris or
sources.paths.
Status: ftp:// and ftps://
via the same file_transfer module
(cloud_connectors feature).
URL:
ftp://etl_user:FAKE_FTP_PASS@ftp.example.com:21/rdp/incoming/data.parquet
| Auth | Notes |
|---|---|
| User / password | URL userinfo; FTP_PASSWORD env
overrides URL password |
| Username | FTP_USER when the URL omits a
user |
| FTPS | ftps:// — default port 990; TLS via
rustls in-tree |
"file_transfer_uris": ["ftp://etl_user@ftp.example.com:21/rdp/incoming/data.parquet"]Fallback: same as SFTP — object store or local paths after external sync.
| Do not put in JSON | Why |
|---|---|
application.properties / Spring aws.*
alone |
Rust does not read Java config files — map to OS env at deploy time |
System.setProperty("AWS…") without exporting to
env |
Native code uses the process environment block, not JVM system properties |
AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY |
AWS chain reads OS env on the process |
AZURE_CLIENT_SECRET, AZURE_TENANT_ID,
bearer tokens |
Azure client reads env / MSI on the Rust process |
GOOGLE_APPLICATION_CREDENTIALS path |
GCS client reads env on the Rust process |
dapi… Databricks PAT |
Not used by in-tree databricks sink (storage path
only) |
| SFTP/FTP passwords for production | Use SFTP_PASSWORD /
FTP_PASSWORD env on the native process —
not pipeline JSON |
jdbc:… URLs for DB read |
Not supported — use ConnectorX oracle:// /
mssql:// in sources.db_reads, or export to a
local file and use sources.paths — see CONNECTORS.md |
| S3 | Azure ADLS (abfss://) |
GCS | |
|---|---|---|---|
| In JSON | s3://bucket/key |
abfss://container@account.dfs…/path |
gs://bucket/path |
| Credentials | AWS_* or IAM role |
AZURE_* / MSI / account key |
GOOGLE_APPLICATION_CREDENTIALS or workload
identity |
| Java’s job | Pass URI + call FFI | Pass URI + call FFI | Pass URI + call FFI |
| Rust’s job | object_store + AWS chain |
object_store + Azure builder |
object_store + GCP |
Bottom line: Rust obtains storage tokens without Java or Python handing them over, as long as the native library’s process is configured correctly. Databricks workspace OAuth/PAT is a separate concern until REST/catalog integration is added.
| Component | Feature |
|---|---|
| Rust crate | cloud_connectors (includes object_store,
Delta staging) |
| Python | cloud on python-wrapper |
| JVM | rdp_jvm_sys link-main (pulls
cloud_connectors) |
DB read (sources.db_reads) is separate:
db_connectorx on JVM — see CONNECTORS.md.