Build ingestion pipelines for structured/unstructured data using Python
Clean normalize and prepare data formats suitable for LLM finetuning (e.g. JSONL CSV)
Create highquality taskspecific datasets for training and evaluation
Apply versioning to datasets using DVC or LakeFS for reproducibility
Generate embeddings using HuggingFace or Sentence Transformers
Manage vector indexes (FAISS Weaviate) and optimize retrieval workflows
Tokenize and chunk longform data for context window optimization
Requirements
10 years experience in Data Engineering role
2 years experience in AIadjacent data role
Proficiency in Python pandas and text processing tools
Familiarity with tokenization libraries (HuggingFace Tokenizers SentencePiece)
Experience managing datasets and object storage (MinIO NFS)
Understanding of LLM data constraints (context windows formatting prompt injection)
Build ingestion pipelines for structured/unstructured data using Python Clean normalize and prepare data formats suitable for LLM finetuning (e.g. JSONL CSV) Create highquality taskspecific datasets for training and evaluation Apply versioning to datasets using D...
Build ingestion pipelines for structured/unstructured data using Python
Clean normalize and prepare data formats suitable for LLM finetuning (e.g. JSONL CSV)
Create highquality taskspecific datasets for training and evaluation
Apply versioning to datasets using DVC or LakeFS for reproducibility
Generate embeddings using HuggingFace or Sentence Transformers
Manage vector indexes (FAISS Weaviate) and optimize retrieval workflows
Tokenize and chunk longform data for context window optimization
Requirements
10 years experience in Data Engineering role
2 years experience in AIadjacent data role
Proficiency in Python pandas and text processing tools
Familiarity with tokenization libraries (HuggingFace Tokenizers SentencePiece)
Experience managing datasets and object storage (MinIO NFS)
Understanding of LLM data constraints (context windows formatting prompt injection)
View more
View less