We are looking for a talented Systems Engineer - AI/OPS to join our team and help us build and operate our AI/OPs platform. The successful candidate will be responsible for designing, implementing, and maintaining the infrastructure that supports our AI/OPs applications, ensuring high availability, scalability, and performance. The ideal candidate will have a strong background in systems engineering, AI/OPs, and cloud computing, with experience in designing and implementing large-scale distributed systems.
Key Responsibilities:
Design and implement AI/OPs infrastructure: Design and implement scalable, highly available, and secure infrastructure to support AI/OPs applications.
Cloud computing expertise: Have expertise in cloud computing platforms such as AWS, Azure, or Google Cloud, and be able to design and implement cloud-based infrastructure.
Distributed systems: Design and implement large-scale distributed systems, including load balancing, caching, and message queuing.
Monitoring and observability: Implement monitoring and observability tools to ensure the health and performance of AI/OPs applications, including:
Active Directory for identity and access management
PRTG for network monitoring
Prometheus for metrics and monitoring
Grafana for data visualization and dashboards
Automation: Implement automation tools to streamline deployment, scaling, and management of AI/OPs applications, including:
JIRA for project management and issue tracking
JSM for service management and incident management
Collaboration: Collaborate with cross-functional teams, including engineering, product management, and operations, to ensure alignment and successful delivery of AI/OPs projects.
Troubleshooting: Troubleshoot complex issues related to AI/OPs applications and infrastructure.
Documentation: Maintain accurate and up-to-date documentation of AI/OPs infrastructure and applications.
Security: Ensure the security and compliance of AI/OPs infrastructure and applications.
Stay up-to-date: Stay current with industry trends and emerging technologies, and apply this knowledge to improve AI/OPs infrastructure and applications.