Vahdettin Karataş
Data & ML engineering — batch pipelines, APIs, validation
  • Location:
    Prague, Czech Republic
Technical focus
  • Batch feature pipelines
  • Definition registries & transforms
  • FastAPI & OpenAPI
  • pytest & CI
  • Reproducible data contracts
Related
Proof of work · batch ML data discipline

Feature Store Mini

Production-minded pattern: one registry for feature meaning, one build path, structured validation, and the same transforms whether you run a CLI job or POST a CSV to FastAPI.

Same code path in CLI and API; errors use stable codes (e.g. missing raw columns). OpenAPI documents the live service.

Python · pandas
pytest · GitHub Actions
FastAPI · OpenAPI
Offline batch only

Limitations & scope

  • Not an online feature store: no point-in-time joins, no real-time serving layer, no Feast/Tecton-scale materialization.
  • Scope is a single-repo pattern for consistent offline features and reviewable validation—not training or model hosting.
  • Demo raw schema is Telco-style tabular data; production would swap the same pattern for your own column contract.

Overview

The repo implements a single build entrypoint that reads a raw CSV (synthetic sample included), applies registered column transforms, and writes a feature table. A compact validation pass runs on the result so duplicates, schema gaps, and coarse sanity issues surface as structured summaries—suitable for logs or JSON responses.

Workflow

Define in registry
Single build step
Validate output
Same logic via FastAPI

Transforms and validation code live under src/features and src/validation. The CLI entry is python -m src.pipeline.build_feature_table; HTTP callers use POST /demo/transform with the same engine.

Engineered outputs

Eight derived columns plus customer_id (names in code: definitions.py):

  • num_active_services — count of active service flags
  • is_long_term_contract — contract term bucket
  • monthly_charge_band — Low / Medium / High from monthly charge
  • charge_per_tenure — ratio with safe handling when tenure is zero
  • has_tech_support, is_fiber_user, has_streaming_bundle — boolean signals
  • feature_version — constant version string for the build

Tech stack

Python 3.11–3.12 (CI) pandas pytest FastAPI + OpenAPI GitHub Actions

No Streamlit dashboard in repo; interaction is CLI, HTTP upload to /demo/transform, or Swagger.

Deployment / live interface

Canonical service: features.vahdetkaratas.com. The app serves GET / when a static UI bundle is deployed under layout-shell/; JSON endpoints and /docs do not depend on that UI. Typical VPS path: install dependencies, run Uvicorn against src.api.main:app, reverse-proxy TLS. Docker is not defined in-repo; a thin image wrapping the same command is enough if you add one later.

Limitations

  • No online feature store: no point-in-time joins, real-time serving layer, or Feast/Tecton-style materialization.
  • In scope: one-repo pattern for consistent offline features and reviewable validation—not model hosting or training orchestration.
  • Sample raw data is synthetic Telco-style tabular CSV; a real deployment swaps the same contracts for your schema.

Why this project

It demonstrates where features come from and how to keep batch scoring or retraining aligned with the same transforms—without claiming a full hosted feature platform. Good fit for code review: small surface, explicit contracts, tests, and an honest limitations block.

Feature Store Mini

Batch pipeline · definition registry · validation · FastAPI

© Vahdettin Karataş. All rights reserved.