Overall objectives
Ensure proactive detection diagnosis and resolution of service health issues across all IT environments
Establish a modern observability function that delivers full visibility into the critical services applications and infra layers
Own and lead the major incident management process ensuring rapid containment clear communication and structured resolution
Drive actionable insights through metrics and logs (MTL) and ensure system health telemetry is used to improve availability performance and user experience
Support operational risk reduction and continuous improvement through RCA trend reporting and resilience engineering
Job scope
Role specific responsibilities
Monitoring and observability engineering
Alerting noise reduction and event correlation
Incident management
Poset incident review and RCA
Dashboarding and health visibility
Service reliability metrics
General functional responsibilities
Define the observability architecture strategy ensuring scalability data security and cost optimisation
Collaborate with app infra and security teams to ensure instrumentation coverage and logging compliance
Maintain operational documentation runbooks escalation matrices and incident playbooks
Drive blameless culture of improvement and incident learning
Align monitoring practices with regulatory and compliance obligations
Represent the observability and incident management function at governance forums
Engage with vendors SaaS providers and cloud platforms to ensure integration with internal monitoring and incident workflows
Coach and mentor monitoring and incident managers to raise maturity across people processes and tooling
Qualifications :
Core competencies required
- Deep expertise in monitoring platforms e.g. ELK AppDynamics Grafana Elastic Datadog APM synthetic monitoring and log aggregation
- Solid understanding of distributed systems microservices and hybrid cloud environments
- Strong command of SRE telemetry pipelines SLI/SLO and alerting strategies
- Experience running 24/7 incident command processes leading war rooms managing comms to executives and driving post-mortems
- Ability to align observability practices to business-critical services and customer impact not just infra health
- Mastery of ITIL event management and incitement management with ITSM platforms like ServiceNow
- Calm decisive leadership in high pressure scenarios excellent cross functional coordination and communication skills
- Overall 15 years of technology experience is desirable
Remote Work :
No
Employment Type :
Full-time
Overall objectivesEnsure proactive detection diagnosis and resolution of service health issues across all IT environmentsEstablish a modern observability function that delivers full visibility into the critical services applications and infra layersOwn and lead the major incident management process ...
Overall objectives
Ensure proactive detection diagnosis and resolution of service health issues across all IT environments
Establish a modern observability function that delivers full visibility into the critical services applications and infra layers
Own and lead the major incident management process ensuring rapid containment clear communication and structured resolution
Drive actionable insights through metrics and logs (MTL) and ensure system health telemetry is used to improve availability performance and user experience
Support operational risk reduction and continuous improvement through RCA trend reporting and resilience engineering
Job scope
Role specific responsibilities
Monitoring and observability engineering
Alerting noise reduction and event correlation
Incident management
Poset incident review and RCA
Dashboarding and health visibility
Service reliability metrics
General functional responsibilities
Define the observability architecture strategy ensuring scalability data security and cost optimisation
Collaborate with app infra and security teams to ensure instrumentation coverage and logging compliance
Maintain operational documentation runbooks escalation matrices and incident playbooks
Drive blameless culture of improvement and incident learning
Align monitoring practices with regulatory and compliance obligations
Represent the observability and incident management function at governance forums
Engage with vendors SaaS providers and cloud platforms to ensure integration with internal monitoring and incident workflows
Coach and mentor monitoring and incident managers to raise maturity across people processes and tooling
Qualifications :
Core competencies required
- Deep expertise in monitoring platforms e.g. ELK AppDynamics Grafana Elastic Datadog APM synthetic monitoring and log aggregation
- Solid understanding of distributed systems microservices and hybrid cloud environments
- Strong command of SRE telemetry pipelines SLI/SLO and alerting strategies
- Experience running 24/7 incident command processes leading war rooms managing comms to executives and driving post-mortems
- Ability to align observability practices to business-critical services and customer impact not just infra health
- Mastery of ITIL event management and incitement management with ITSM platforms like ServiceNow
- Calm decisive leadership in high pressure scenarios excellent cross functional coordination and communication skills
- Overall 15 years of technology experience is desirable
Remote Work :
No
Employment Type :
Full-time
اعرض المزيد
عرض أقل