NVIDIA and Deutsche Telekom are jointly developing the world’s first industrial AI cloud for European manufacturers. This AI factory in Germany will host 10,000 GPUs across NVIDIA DGX B200 systems and RTX Pro Servers. Deutsche Telekom provides secure, sovereign and fast infrastructure, including data centers, operations, security, and AI solutions.
Role Overview:
We are seeking an Enterprise Architect for Network Infrastructure at Industrial AI Cloud to design, build, automate network platform for automation and operation related network components such as Switches, Firewalls, Routers, Border Gateways as part of core environment of the Industrial AI Cloud. In this role you will design, provision and manage above mentioned stack, implement and fine-tune monitoring, and deploy additional components if necessary. You’ll be working and coordinating between multiple teams (such as Infrastructure, Platform) to deliver and continuously improve infrastructure services following ITIL processes.
Enterprise Architect Considers and defines design to enable automated configuration management, release management, build, test and deployment activities. This is a customer facing role/ tailor made solutions and implementations for the customer including consultancy. Proprietary technologies used for managing above scope: InfiniBand, Cumullus OS, RoCE, UFM, FortiGate friewalls, Cisco Border gateways.
WHAT WILL YOU DO?
- Coordinate Operations together with Data Center, IaaS & PaaS layer: Coordinate and support network lifecycle activities (installs, upgrades, changes, firmware updates) and manage /network interconnections and related documentation
- Switch & Firewall Management: Provision and maintain InfiniBand switches according to ITIL Standards
- Automation: Develop and maintain automation scripts to orchestrate overall scope. Fine tuning, configuration changes through whole project lifetime
- OS & Firmware Management: Maintain network-based environments, apply patches, and manage firmware upgrades at scale.
- Monitoring & Observability:
- ITIL Processes: Follow and improve incident, problem, and change management workflows; document runbooks and standard operating procedures. Adhere to ZERO Outage guidelines.
Cross-Team Collaboration: Work closely with Platform Engineers and AI solution teams to ensure smooth deployments and operations. - Manage High-Speed Fabric: A unified network fabric utilizing both InfiniBand and Ethernet / RoCE technologies.
- Management Network: A separate 1 Gbps Ethernet and serial console for out-of-band (OOB) network management.
- PE/CE datacenter connectivity: CE routers, firewalls Design, develop, test, implement and support ICT components and applications to deliver quality standard product portfolio on AI Factory Cloud platform.
- Build and develop concepts, processes and methods for automation, optimization, and standardization to satisfy efficiency and automation requirements.
- Provide advice or information at request or at own initiative to all relevant employees or customers regarding technical aspects of products.
- Provide project deliverables to fulfil the project scope.
- Consult and implement new innovative technologies to satisfy innovation strategy.
- Provide overall solutions and principles in planning, developing, and implementing new products to satisfy business requirements.
- Design, develop, and implement architecture of services based on AI Factory Cloud platform requirements.
- Mentor and train co-workers to spread knowledge level and develop their skills.
- Act as key technical lead and solve and coordinate activities across related technologies/outside own team.
- Provide consulting services to project teams on areas of expertise.
- Research and development in assigned technology, determine business requirements, propose changes and develop implementation plans.
- The network architect must demonstrate expertise and experience with the following technologies, as our environment relies on their optimal configuration and management of InfiniBand, RoCE, managing UFM, FortiGate Firewalls and Cisco Border gateways.
- Experience with the following technologies, as our environment relies on their optimal configuration and management of InfiniBand, RoCE, managing UFM, FortiGate Firewalls and Cisco Border gateways.
We expect:
- Initial provisioning
- Daily operation (we expect 24/7 operations with On Calls)
- Configuration, upgrades, patch management (including critical patch mgmt.), incident handling
- Fine tuning, configuration changes through whole project lifetime
- Hardware case handling
We are seeking an Enterprise Architect - Network Architect to provide 24/7/365 support EU -based (due to sovereignty requirements of product) for our high-performance Industry AI Cloud network fabric.
The scope of this is to provide ongoing operations, maintenance, troubleshooting, and configuration management for the network switches, gateways, and DPUs that comprise the core interconnect of all devices.
All hardware vendor support for the compute servers (NVIDIA DGX, RTX, management servers) is handled separately; this request is solely for the network infrastructure.
The project will be deployed in a 3rd-party datacenter at Polarise, Tucherpark, Munich. Initial connectivity will be via IP Transit AS3320 from DTAG with a private AS for in-band connection and a local internet breakout for out-of-band management. We will not have AdminLAN access initially.
Networking:
- High-Speed Fabric: A unified network fabric utilizing both InfiniBand and Ethernet / RoCE technologies.
- Management Network: A separate 1 Gbps Ethernet and serial console for out-of-band (OOB) network management.
- PE/CE datacenter connectivity: CE routers, firewalls