About this role

Team Overview

The Service Management team provides industry‑standard Incident, Problem and Change Management, alongside infrastructure operational support for Aladdin. Weoperateusing modern engineering practices and tooling, including ServiceNow and AI‑enabled workflows, and measure outcomes through clear operational metrics.

Incident Managementis responsible forrestoring service during production incidents and driving scalable stability improvements across BlackRock and its Aladdin clients.

BlackRockoperatesa 24/7 Major Incident Management function supporting global clients across Europe, the Americas, AsiaPacificand India. This role is based in Edinburgh and isrequiredto cover core European hours between 09:00 and 18:00, Monday to Sunday, with rotational weekend working.

Role

We are seeking an experienced Incident & Problem Manager (5+ years) with a strong passion for technical troubleshooting and the ability to lead multiple simultaneous incidents.

This role exists to deliver rapid time to detect and time to resolve, and toeliminaterepeat incidents at a system level byoperatingan AI‑first incident delivery model. The Major Incident & Problem Manager is accountable for turning incidents into measurable stability improvements—particularly those caused by change—and for building an incident operating rhythm where AI handles correlation,classificationand narrative generation by default, allowing humans to focus on decision quality, trade‑offs and prevention.

In complex distributed platforms, incidents are often slowed by manual triage, fragmentedownershipand time‑consuming coordination. This role addresses those challenges by creating a decision‑centric incident response model, powered by AI‑driven signal correlation and automation‑first execution, ensuring that:

The right responders are engaged faster

Themost likely causesareidentifiedsooner

Mitigation decisions are taken with clearer risk framing

Communicationsremainaccurateandtimely

Repeat failures are systematically removed rather than documented

The role partners closely with Engineering and SRE / DevOps teams,leveragingautomation, observabilitytoolingand emerging AI‑driven insights. The successful candidate will have a DevOps mindset, be able to actively troubleshoot, and utilise and enhance AI and automation.

The role also includes participation in continuous improvement initiatives aimed at improving the stability,performanceand resilience of the Aladdin platform, and enhancing Service Management services.

Key Responsibilities

1. Lead major incidents as a decision authority (P1–P4)

Lead end‑to‑end management of production incidents, including investigation, recoveryexecutionand closure

Run incidents as a decision system, driving clarity on what is known, what is suspected and what action is taken next

Manage multiple simultaneous incidents whilemaintainingconsistent prioritisation and escalation

2. Operate an AI‑first incident workflow (human‑validated, human‑overridden when required)

Triage and categorise incidents using AI‑driven classification, with human validation and override where appropriate

Drive AI‑automated ticket routing and apply risk‑based escalation judgement when automation is insufficient

Ensure incident timelines and summaries are produced to a high standard using AI‑generated artefacts, correcting them whererequired

3. Supervise automated remediation and agentic responders

Supervise automated remediation and agentic responders, intervening to pause, override or redirect when risk requires

Ensure automated remediation is safe,auditableand aligned with service ownership and operational readiness

4. Manage a robust Problem Management process to prevent incident recurrence

Ensure root causes and preventative actions are clearly captured and translated into an effective Problem Management process

Identifyincident trends and repeat patterns, driving scalable remediation to reduce recurrence

Partner with Engineering and SRE / DevOps to embed learnings into automation, observability,runbooksand readiness controls

Design, build and activelymaintaina Known Error Database that functions as a real‑time operational asset

Work with product teams to design, build and deliver a meaningful process for addressing repeat incidents

5. Deliver executive‑grade communications (AI‑drafted, human‑approved)

Validate,approveand issue regular communications that are concise,informativeandappropriate forstakeholders

Ensure communications accurately reflect impact, mitigation progress, keyrisksand confidence‑based ETAs

6. Drive continuous service improvement and regulatory alignment

Drive process and tooling changes that support operational resilience and regulatory requirements, including DORA and GDPR, where applicable

Provide input and ownership for continual service improvement initiatives, with a primary focus on Agentic AI and its application to Incident Management

Required Experience and Capabilities (Must Have)

5+ years’ experience in Incident and Problem Management within a production environment supporting business‑critical platforms

Strong technical troubleshooting capability, with the ability to engage credibly with engineers during complex failures

Proven ability to lead multiple simultaneous incidents and drive structured recovery under pressure

DevOps mindset, with comfort using observability tooling,

Major Incident and Problem Manager, Associate

Job Description

About blackrock

Similar Jobs

Quantitative Modeler (Python), Aladdin Financial Engineering, Vice President

Implementation Consultant, Aladdin Client Technology, Vice President