Role Summary
Build robust observable data pipelines that power research and production AI. Success means high pipeline reliability (on-time SLAs) strong data quality (validation & lineage) and enabling fast experimentation. You will partner with AI/ML analytics and product to make data trustworthy and usable.
Responsibilities
- Architect and operate batch/stream pipelines (Airflow; Spark optional) for ETL/ELT.
- Model/manage schemas; enforce data quality and lineage/governance.
- Support ML workflows with DVC (data versioning) and MLflow or Weights & Biases.
- Build feature stores/data services; expose datasets via secure REST endpoints.
- Optimize performance/cost across storage/compute; implement monitoring/alerting.
- Maintain documentation and internal catalogs; enable self-service analytics.
Qualifications
- Skills: Programming in C or Java; SQL & NoSQL; Pandas/NumPy; PySpark; Airflow; API development; Docker.
- MLOps: DVC; MLflow or W&B; model packaging/deployment fundamentals.
- Cloud: AWS SageMaker Azure ML or GCP AI experience.
- Nice to have: Unreal Engine exposure.
- Environment: Solid Linux background for development and deployment.
- Education/Experience: Proven experience building reliable pipelines in production