Roles and responsibilities
The Data Engineer intern will be participating in exciting projects covering the end-to-end data lifecycle - from raw data integrations with primary and third-party systems, through advanced data modelling, to state-of-the-art data visualisation and development of innovative data products.
You will have the opportunity to learn how to build and work with both batch and real-time data processing pipelines. You will work in a modern cloud-based data warehousing environment alongside a team of diverse, intense and interesting co-workers. You will liaise with other departments - such as product & tech, the core business verticals, trust & safety, finance and others - to enable them to be successful.
Key Responsibilities Include:
● Raw data integrations with primary and third-party systems
● Data warehouse modelling for operational and application data layers
● Development in Amazon Redshift cluster
● SQL development as part of agile team workflow
● ETL design and implementation in Matillion ETL
● Design and implementation of data products enabling data-driven features or business solutions
● Data quality, system stability and security
● Coding standards in SQL, Python, ETL design
● Building data dashboards and advanced visualisations in Periscope Data with a focus on UX, simplicity and usability
● Working with other departments on data products - i.e. product & technology, marketing & growth, finance, core business, advertising and others
● Being part and contributing towards a strong team culture and ambition to be on the cutting edge of big data
Requirements
- Bachelor's degree in computer science, engineering, math, physics or any related quantitative field
- Knowledge of relational and dimensional data models
- Knowledge of terminal operations and Linux workflows
- Ability to communicate insights and findings to a non-technical audience
- Good SQL skills across a variety of relational data warehousing technologies especially in cloud data warehousing (e.g. Amazon Redshift, Google BigQuery, Snowflake, Vertica, etc.)
- Attention to details and analytical thinking
- Entrepreneurial spirit and ability to think creatively; highly-driven and self-motivated; strong curiosity and strive for continuous learning
Desired candidate profile
1. Data Architecture and Infrastructure
- Designing Data Pipelines: Develop, construct, and maintain efficient data pipelines that enable the movement and transformation of large datasets between various systems and storage solutions.
- Building Data Warehouses: Create data storage solutions like data warehouses or data lakes that allow easy access, retrieval, and analysis of data from various sources (e.g., transactional databases, cloud platforms).
- Database Design and Optimization: Design databases, ensuring that they are scalable, secure, and optimized for both performance and storage.
2. Data Collection and Integration
- Data Integration: Integrate data from a variety of sources such as APIs, databases, flat files, cloud storage, and real-time data streams into centralized systems.
- ETL Processes (Extract, Transform, Load): Develop and maintain ETL processes to clean, transform, and load raw data into usable formats for analytics.
- Data Governance: Ensure that data is accurate, consistent, and secure by enforcing data governance practices, and maintaining data quality standards.
3. Data Transformation and Processing
- Data Cleaning: Process and clean raw data to remove inconsistencies, errors, or duplicates, ensuring that the data used for analysis is reliable and of high quality.
- Data Transformation: Transform data into a structured format suitable for analysis, reporting, and further processing by data scientists and analysts.
4. Performance and Scalability
- Optimization: Continuously monitor, optimize, and troubleshoot data pipelines and storage solutions to ensure they perform efficiently at scale, especially as data volumes grow.
- Automation: Automate repetitive tasks like data loading and monitoring to reduce manual effort and improve the efficiency of data processing.
5. Collaboration with Data Scientists and Analysts
- Collaborate on Data Needs: Work closely with data scientists and analysts to understand their data requirements and provide them with clean, organized, and ready-to-use datasets.
- Provide Data Access: Ensure that analysts and other users can easily access and query the data they need by setting up efficient querying tools and user interfaces.
6. Cloud Platforms and Big Data
- Cloud Solutions: Leverage cloud-based platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure) for scalable data storage and computing resources.
- Big Data Technologies: Implement and manage big data technologies (e.g., Hadoop, Spark, Kafka) to process and analyze large datasets across distributed systems.