drjobs Senior Engineer- Alerting & Incident Management English

Senior Engineer- Alerting & Incident Management

صاحب العمل نشط

1 وظيفة شاغرة
drjobs

حالة تأهب وظيفة

سيتم تحديثك بأحدث تنبيهات الوظائف عبر البريد الإلكتروني
Valid email field required
أرسل الوظائف
drjobs
أرسل لي وظائف مشابهة
drjobs

حالة تأهب وظيفة

سيتم تحديثك بأحدث تنبيهات الوظائف عبر البريد الإلكتروني

Valid email field required
أرسل الوظائف
موقع الوظيفة drjobs

أبوظبي - الإمارات

الراتب شهرياً drjobs

لم يكشف

drjobs

لم يتم الكشف عن الراتب

عدد الوظائف الشاغرة

1 وظيفة شاغرة

الوصف الوظيفي

Overall objectives

To establish and maintain an effective intelligent and timely alerting framework across infrastructure application and business services.

To coordinate and continuously improve the incident management lifecycle with a focus on early detection rapid response and root cause accountability.

To integrate observability data (logs metrics traces) into a unified alerting and incident response workflow.

To reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through automation clear escalation paths and operational discipline.

Role specific responsibilities

Manage and continuously improve the incident response process including triage escalation status communications and resolution tracking.

Act as the incident commander during major outages or high-severity issues coordinating technical teams toward resolution.

Maintain and govern on-call schedules escalation paths and responder playbooks.

Integrate observability tools with incident management platforms to enable real-time contextual alerting.

Lead and document root cause analysis (RCA) and ensure completion of follow-up actions and preventive measures.

Report on incident metrics and trends identifying areas for resilience and process improvement.

General functional responsibilities

Maintain detailed documentation on alert rules incident workflows contact rosters and escalation trees.

Ensure compliance with regulatory audit and risk management requirements related to incident response and system availability.

Collaborate with monitoring logging and APM peers to align telemetry signals with operational response.

Work with development infrastructure and support teams to embed alert and incident management best practices in SDLC and change management.

Participate in regular incident simulations and on-call readiness drills.

Drive continuous improvement through retrospective reviews blameless post-mortems and incident automation.


Qualifications :

Core competencies required

Strong experience with alert management platforms such as Opsgenie Splunk On-Call ServiceNow Event Management or VictorOps.

Familiarity with routing rules escalation policies noise suppression on-call schedules and alert deduplication.

Deep understanding of the end-to-end incident management processdetection triage escalation communication and closure.

Proficient in running major incident bridges documenting timelines and leading post-incident reviews (PIRs/RCAs).

Calm and assertive in high-pressure incident scenarios.

Excellent communicatorable to coordinate with technical and business stakeholders during incidents..


Remote Work :

No


Employment Type :

Full-time

نوع التوظيف

دوام كامل

نبذة عن الشركة

الإبلاغ عن هذه الوظيفة
إخلاء المسؤولية: د.جوب هو مجرد منصة تربط بين الباحثين عن عمل وأصحاب العمل. ننصح المتقدمين بإجراء بحث مستقل خاص بهم في أوراق اعتماد صاحب العمل المحتمل. نحن نحرص على ألا يتم طلب أي مدفوعات مالية من قبل عملائنا، وبالتالي فإننا ننصح بعدم مشاركة أي معلومات شخصية أو متعلقة بالحسابات المصرفية مع أي طرف ثالث. إذا كنت تشك في وقوع أي احتيال أو سوء تصرف، فيرجى التواصل معنا من خلال تعبئة النموذج الموجود على الصفحة اتصل بنا