Data Engineer PySpark

Virtusa

Not Interested
Bookmark
Report This Job

profile Job Location:

Dubai - UAE

profile Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

Pyspark JD:ResponsibilitiesData Pipeline Development: Design develop and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform ensuring data integrity and Ingestion: Implement and manage data ingestion processes from a variety of sources (relational databases APIs file systems) to the data lake or data warehouse on Transformation and Processing: Use PySpark to process cleanse and transform large datasets into meaningful formats that support analytical needs and business Optimization: Conduct performance tuning of PySpark code and Cloudera components optimizing resource utilization and reducing runtime of ETL Quality and Validation: Implement data quality checks monitoring and validation routines to ensure data accuracy and reliability throughout the and Orchestration: Automate data workflows using tools like Apache Oozie Airflow or similar orchestration tools within the Cloudera and Maintenance: Monitor pipeline performance troubleshoot issues and perform routine maintenance on the Cloudera Data Platform and associated data : Work closely with other data engineers analysts product managers and other stakeholders to understand data requirements and support various datadriven : Maintain thorough documentation of data engineering processes code and pipeline and ExperienceBachelors or Masters degree in Computer Science Data Engineering Information Systems or a related field.3 years of experience as a Data Engineer with a strong focus on PySpark and the Cloudera Data SkillsPySpark: Advanced proficiency in PySpark including working with RDDs DataFrames and optimization Data Platform: Strong experience with Cloudera Data Platform (CDP) components including Cloudera Manager Hive Impala HDFS and Warehousing: Knowledge of data warehousing concepts ETL best practices and experience with SQLbased tools (Hive Impala).Big Data Technologies: Familiarity with Hadoop Kafka and other distributed computing and Scheduling: Experience with Apache Oozie Airflow or similar orchestration and Automation: Strong scripting skills in SkillsStrong analytical and problemsolving verbal and written communication to work independently and collaboratively in a team to detail and commitment to data quality.
Pyspark JD:ResponsibilitiesData Pipeline Development: Design develop and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform ensuring data integrity and Ingestion: Implement and manage data ingestion processes from a variety of sources (relational databa...
View more view more

Key Skills

  • Apache Hive
  • S3
  • Hadoop
  • Redshift
  • Spark
  • AWS
  • Apache Pig
  • NoSQL
  • Big Data
  • Data Warehouse
  • Kafka
  • Scala

About Company

Inside every Virtusan is a spirit defined by the drive to explore new frontiers, an intellectual curiosity, a need to challenge the status quo, and the inspiration to mind the greater good–all while impacting the bottom line. It all adds up to a culture of innovation.

View Profile View Profile