Job Description:
Keywods
- Kafka Confluent
- Real-Time Streaming Data Ingestion
- Producers Consumers Topics
- Data Lake Integration Lakehouse Integration
- Azure Databricks Python Scala Java.
Databricks Engineer Positions: 2 1. Overview Duration: 3 Months Location: Dubai Bank is building a scalable data ingestion and streaming platform that ingests change data capture (CDC) events from diverse source systems (databases and applications) processes them in real time and lands curated data into our analytics lake. The platform uses Confluent connectors (Debezium/Oracle CDC) to emit Parquet files into cloud storage and leverages Databricks Auto Loader to incrementally ingest deduplicate and write this data into Delta Lake Bronze table. To ensure broad applicability the following job description emphasizes generic streaming and data engineering skills while highlighting the core technologies used in our solution. 2. Responsibilities Design and develop streaming ingestion pipelines. Use Apache Spark (Structured Streaming) and Databricks Auto Loader to consume files from cloud storage or messages from Kafka/RabbitMQ/Confluent Cloud and ingest them into Delta Lake ensuring schema evolution and exactly once semantics. Implement CDC and deduplication logic. Capture change events from source databases using Debezium built-in CDC features of SQL Server/ Oracle or other connectors. Apply watermarking and drop duplicate strategies based on primary keys and event timestamps. Ensure data quality and fault tolerance. Configure checkpointing error handling and deadletter queues (DLQ) so that malformed or late data can be quarantined and replayed. Optimise file sizes partitioning and clustering to maintain performance. Scale ingestion through configuration. Build a config-driven framework (e.g. using Airflow DBX Jobs or Delta Live Tables) that iterates over metadata tables to deploy/update ingestion pipelines for hundreds of tables/sources without code duplication. Collaborate on architecture and orchestration. Contribute to the overall data platform architectureintegrating data sources message queues processing engines and storageand define orchestration patterns for backfill replay and streaming jobs. Implement monitoring observability and security. Capture streaming query metrics and publish them to monitoring platforms (Prometheus Grafana). Set up dashboards for lag files processed and processing duration. Enforce role-based access control encryption and data masking. Work with data consumers. Partner with analytics teams data scientists and downstream application developers to ensure that ingested data meets their requirements. Provide documentation metadata and lineage for all tables. Participate in DevOps processes. Use CI/CD pipelines (e.g. Jenkins GitHub Actions) to automate deployment of jobs; manage infrastructure with Terraform or similar tools; follow best practices for version control and code reviews. 3. Required skills & Experience 58 years of experience designing and building data pipelines using Apache Spark Databricks or equivalent bigdata frameworks. Handson expertise with streaming and messaging systems such as Apache Kafka (publish subscribe architecture) Confluent Cloud RabbitMQ or Azure Event Hub. Experience creating producers consumers and topics and integrating them into downstream processing. Deep understanding of relational databases and CDC. Proficiency in SQL Server Oracle or other RDBMSs; experience capturing change events using Debezium or native CDC tools and transforming them for downstream consumption. Proficiency in programming languages such as Python Scala or Java and solid knowledge of SQL for data manipulation and transformation. Cloud platform expertise. Experience with Azure or AWS services for data storage compute and orchestration (e.g. ADLS S3 Azure Data Factory AWS Glue Airflow DBX DLT). Data modelling and warehousing. Knowledge of data Lakehouse architectures Delta Lake partitioning strategies and performance optimisation. Version control and DevOps. Familiarity with Git and CI/CD pipelines; ability to automate deployment and manage infrastructure as code. Strong problem solving and communication skills. Ability to work with cross functional teams and articulate complex technical concepts to nontechnical stakeholders. 4. Preferred/ Bonus Skills Experience with event driven architectures and micro services integration. Exposure to NiFi Flume or other ingestion frameworks for connecting heterogeneous sources. Knowledge of graph processing or machine learning pipelines on Spark.
Required Experience:
Senior IC
Job Description:KeywodsKafka ConfluentReal-Time Streaming Data IngestionProducers Consumers TopicsData Lake Integration Lakehouse IntegrationAzure Databricks Python Scala Java.Databricks Engineer Positions: 2 1. Overview Duration: 3 Months Location: Dubai Bank is building a scalable data ingestio...
Job Description:
Keywods
- Kafka Confluent
- Real-Time Streaming Data Ingestion
- Producers Consumers Topics
- Data Lake Integration Lakehouse Integration
- Azure Databricks Python Scala Java.
Databricks Engineer Positions: 2 1. Overview Duration: 3 Months Location: Dubai Bank is building a scalable data ingestion and streaming platform that ingests change data capture (CDC) events from diverse source systems (databases and applications) processes them in real time and lands curated data into our analytics lake. The platform uses Confluent connectors (Debezium/Oracle CDC) to emit Parquet files into cloud storage and leverages Databricks Auto Loader to incrementally ingest deduplicate and write this data into Delta Lake Bronze table. To ensure broad applicability the following job description emphasizes generic streaming and data engineering skills while highlighting the core technologies used in our solution. 2. Responsibilities Design and develop streaming ingestion pipelines. Use Apache Spark (Structured Streaming) and Databricks Auto Loader to consume files from cloud storage or messages from Kafka/RabbitMQ/Confluent Cloud and ingest them into Delta Lake ensuring schema evolution and exactly once semantics. Implement CDC and deduplication logic. Capture change events from source databases using Debezium built-in CDC features of SQL Server/ Oracle or other connectors. Apply watermarking and drop duplicate strategies based on primary keys and event timestamps. Ensure data quality and fault tolerance. Configure checkpointing error handling and deadletter queues (DLQ) so that malformed or late data can be quarantined and replayed. Optimise file sizes partitioning and clustering to maintain performance. Scale ingestion through configuration. Build a config-driven framework (e.g. using Airflow DBX Jobs or Delta Live Tables) that iterates over metadata tables to deploy/update ingestion pipelines for hundreds of tables/sources without code duplication. Collaborate on architecture and orchestration. Contribute to the overall data platform architectureintegrating data sources message queues processing engines and storageand define orchestration patterns for backfill replay and streaming jobs. Implement monitoring observability and security. Capture streaming query metrics and publish them to monitoring platforms (Prometheus Grafana). Set up dashboards for lag files processed and processing duration. Enforce role-based access control encryption and data masking. Work with data consumers. Partner with analytics teams data scientists and downstream application developers to ensure that ingested data meets their requirements. Provide documentation metadata and lineage for all tables. Participate in DevOps processes. Use CI/CD pipelines (e.g. Jenkins GitHub Actions) to automate deployment of jobs; manage infrastructure with Terraform or similar tools; follow best practices for version control and code reviews. 3. Required skills & Experience 58 years of experience designing and building data pipelines using Apache Spark Databricks or equivalent bigdata frameworks. Handson expertise with streaming and messaging systems such as Apache Kafka (publish subscribe architecture) Confluent Cloud RabbitMQ or Azure Event Hub. Experience creating producers consumers and topics and integrating them into downstream processing. Deep understanding of relational databases and CDC. Proficiency in SQL Server Oracle or other RDBMSs; experience capturing change events using Debezium or native CDC tools and transforming them for downstream consumption. Proficiency in programming languages such as Python Scala or Java and solid knowledge of SQL for data manipulation and transformation. Cloud platform expertise. Experience with Azure or AWS services for data storage compute and orchestration (e.g. ADLS S3 Azure Data Factory AWS Glue Airflow DBX DLT). Data modelling and warehousing. Knowledge of data Lakehouse architectures Delta Lake partitioning strategies and performance optimisation. Version control and DevOps. Familiarity with Git and CI/CD pipelines; ability to automate deployment and manage infrastructure as code. Strong problem solving and communication skills. Ability to work with cross functional teams and articulate complex technical concepts to nontechnical stakeholders. 4. Preferred/ Bonus Skills Experience with event driven architectures and micro services integration. Exposure to NiFi Flume or other ingestion frameworks for connecting heterogeneous sources. Knowledge of graph processing or machine learning pipelines on Spark.
Required Experience:
Senior IC
اعرض المزيد
عرض أقل