About this role
Team Overview
The Service Management team provides industryāstandard Incident, Problem and Change Management, alongside infrastructure operational support for Aladdin. Weoperateusing modern engineering practices and tooling, including ServiceNow and AIāenabled workflows, and measure outcomes through clear operational metrics.
Incident Managementis responsible forrestoring service during production incidents and driving scalable stability improvements across BlackRock and its Aladdin clients.
BlackRockoperatesa 24/7 Major Incident Management function supporting global clients across Europe, the Americas, AsiaPacificand India. This role is based in Edinburgh and isrequiredto cover core European hours between 09:00 and 18:00, Monday to Sunday, with rotational weekend working.
Role
We are seeking an experienced Incident & Problem Manager (5+ years) with a strong passion for technical troubleshooting and the ability to lead multiple simultaneous incidents.
This role exists to deliver rapid time to detect and time to resolve, and toeliminaterepeat incidents at a system level byoperatingan AIāfirst incident delivery model. The Major Incident & Problem Manager is accountable for turning incidents into measurable stability improvementsāparticularly those caused by changeāand for building an incident operating rhythm where AI handles correlation,classificationand narrative generation by default, allowing humans to focus on decision quality, tradeāoffs and prevention.
In complex distributed platforms, incidents are often slowed by manual triage, fragmentedownershipand timeāconsuming coordination. This role addresses those challenges by creating a decisionācentric incident response model, powered by AIādriven signal correlation and automationāfirst execution, ensuring that:
The right responders are engaged faster
Themost likely causesareidentifiedsooner
Mitigation decisions are taken with clearer risk framing
Communicationsremainaccurateandtimely
Repeat failures are systematically removed rather than documented
The role partners closely with Engineering and SRE / DevOps teams,leveragingautomation, observabilitytoolingand emerging AIādriven insights. The successful candidate will have a DevOps mindset, be able to actively troubleshoot, and utilise and enhance AI and automation.
The role also includes participation in continuous improvement initiatives aimed at improving the stability,performanceand resilience of the Aladdin platform, and enhancing Service Management services.
Key Responsibilities
1. Lead major incidents as a decision authority (P1āP4)
Lead endātoāend management of production incidents, including investigation, recoveryexecutionand closure
Run incidents as a decision system, driving clarity on what is known, what is suspected and what action is taken next
Manage multiple simultaneous incidents whilemaintainingconsistent prioritisation and escalation
2. Operate an AIāfirst incident workflow (humanāvalidated, humanāoverridden when required)
Triage and categorise incidents using AIādriven classification, with human validation and override where appropriate
Drive AIāautomated ticket routing and apply riskābased escalation judgement when automation is insufficient
Ensure incident timelines and summaries are produced to a high standard using AIāgenerated artefacts, correcting them whererequired
3. Supervise automated remediation and agentic responders
Supervise automated remediation and agentic responders, intervening to pause, override or redirect when risk requires
Ensure automated remediation is safe,auditableand aligned with service ownership and operational readiness
4. Manage a robust Problem Management process to prevent incident recurrence
Ensure root causes and preventative actions are clearly captured and translated into an effective Problem Management process
Identifyincident trends and repeat patterns, driving scalable remediation to reduce recurrence
Partner with Engineering and SRE / DevOps to embed learnings into automation, observability,runbooksand readiness controls
Design, build and activelymaintaina Known Error Database that functions as a realātime operational asset
Work with product teams to design, build and deliver a meaningful process for addressing repeat incidents
5. Deliver executiveāgrade communications (AIādrafted, humanāapproved)
Validate,approveand issue regular communications that are concise,informativeandappropriate forstakeholders
Ensure communications accurately reflect impact, mitigation progress, keyrisksand confidenceābased ETAs
6. Drive continuous service improvement and regulatory alignment
Drive process and tooling changes that support operational resilience and regulatory requirements, including DORA and GDPR, where applicable
Provide input and ownership for continual service improvement initiatives, with a primary focus on Agentic AI and its application to Incident Management
Required Experience and Capabilities (Must Have)
5+ yearsā experience in Incident and Problem Management within a production environment supporting businessācritical platforms
Strong technical troubleshooting capability, with the ability to engage credibly with engineers during complex failures
Proven ability to lead multiple simultaneous incidents and drive structured recovery under pressure
DevOps mindset, with comfort using observability tooling,Ā
blackrock