Roles and responsibilities
- Create automated testing approaches and infrastructure for validating reliability, performance, and resilience of cloud orchestration tools and applications
- Enable engineering teams across Canonical to develop software with confidence by making distributed system testing tooling available across the company
- Enhance continuous integration pipelines for deploying and testing Canonical's cloud native products such as Kubeflow
- Deploy, manage, and debug highly distributed systems
- Monitor and report on automated testing efforts
- Collaborate daily with a globally distributed team
What we are looking for in you
- Solid background in modern test processes and strategies
- Experience with Python or Go development
- Strong object oriented development skills
- Ability to develop and ship production grade modern web applications
- Working knowledge of continuous integration tools such as Jenkins, CircleCI, GitHub CI
- Knowledge of networking technologies and fundamentals
- Solid understanding of Linux system architecture
- A capacity for complex abstract thinking
- Capability for 2-4 weeks of international travel travel per year
Additional skills that you might also bring
- Collecting and analyzing large multidimensional datasets
- Operating data platforms: key-value stores, relational or document databases, event buses
- Work with cloud technologies such as OpenStack, Kubernetes, Terraform and AWS
- Developing AI/ML pipelines
-
1. Strong Knowledge of Distributed Systems Concepts
- Understanding key distributed systems principles like the CAP theorem, eventual consistency, sharding, replication, and partition tolerance.
- Knowledge of distributed algorithms for consensus (e.g., Paxos, Raft) and leader election.
-
2. Proficiency with Python
- Proficiency in Python, with an emphasis on advanced topics like asynchronous programming, multi-threading, and concurrent processing.
- Familiarity with Python libraries that are commonly used in distributed systems (e.g., asyncio, Celery, Pyro5, requests, gRPC).
-
3. Experience with Distributed Computing Tools
- Experience with message brokers like RabbitMQ or Kafka, and task queues like Celery.
- Familiarity with distributed storage solutions like Redis, Cassandra, or MongoDB.
- Knowledge of cloud platforms like AWS, GCP, or Azure, and how they support distributed computing.
Desired candidate profile
1. Designing and Implementing Distributed Systems
- Architecture Design: Designing robust, scalable, and efficient distributed system architectures, including handling challenges like load balancing, failover, and data consistency.
- Service Communication: Implementing efficient communication protocols (e.g., HTTP, gRPC, Kafka, RabbitMQ) to enable services within a distributed system to interact with each other.
- Fault Tolerance and Reliability: Ensuring the system can gracefully handle failures and recover from them without data loss or significant downtime. This includes implementing mechanisms like retries, circuit breakers, and health checks.
2. Performance Optimization
- Latency and Throughput: Ensuring that the distributed system performs well under high load and optimizes latency. Identifying and resolving bottlenecks in communication or computation.
- Scalability: Designing systems to scale horizontally (i.e., adding more machines) or vertically (i.e., enhancing machine power) as needed to handle increasing workloads.
- Concurrency Handling: Using Python’s libraries, such as asyncio, threading, or multiprocessing, to efficiently manage multiple concurrent operations in a distributed environment.
3. Implementing Consistency Models
- CAP Theorem Consideration: Understanding the trade-offs between consistency, availability, and partition tolerance, and choosing the right model (e.g., eventual consistency, strong consistency) for different components.
- Distributed Databases and Caching: Working with distributed databases (e.g., Cassandra, MongoDB, etc.) or distributed caching systems (e.g., Redis, Memcached) to ensure efficient and consistent data storage across nodes.
4. Handling Data Distribution and Synchronization
- Data Replication: Implementing data replication strategies to ensure that data is consistently and reliably available across all nodes.
- Eventual Consistency: Handling scenarios where systems must eventually reach consistency, even if not immediately. This could involve implementing and managing tools like event sourcing or message queues (e.g., Kafka).
- State Management: Managing distributed states using coordination tools (e.g., Zookeeper, Consul) for leader election, configuration management, or distributed locks.
5. Monitoring, Logging, and Troubleshooting
- Distributed Tracing: Implementing tracing (e.g., using OpenTelemetry) to monitor and troubleshoot the flow of requests across multiple services in a distributed environment.
- Logging: Centralizing logs across different distributed nodes using tools like ELK Stack (Elasticsearch, Logstash, and Kibana), Fluentd, or similar solutions.
- Metrics and Alerts: Setting up system monitoring with metrics (e.g., Prometheus, Grafana) and alerting to track the health and performance of the system.