Leading the SRE team, setting objectives, and guiding the team towards achieving high reliability while balancing cost and performance SLAs.
Collaborating with platform & product engineering teams to embed reliability and operational best practices into the software development lifecycle.
Developing and implementing SRE policies and practices, including service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Driving automation across operations to reduce toil, improve system performance, ensure scalability, with a reasonable amount of allergic response towards repetitive manual work.
Overseeing incident management, post-mortem analyses, and root cause investigations to prevent future outages and enhance system reliability.
Facilitating capacity planning and scalability exercises to manage growth and ensure the efficient use of resources.
Facilitating disaster recovery plans & testing to ensure business continuity for our customers’ webstores.
Encouraging a culture of continuous improvement by mentoring team members and fostering innovation within the team.
Staying up to date with the latest trends and technologies in SRE and advocating for their adoption where appropriate.