Roche company logo

Roche is hiring a Workload Orchestration Engineer

Get the latest jobs to your inbox!

Job Description

At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections,  where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.

The Position

Job description

As a Workload Orchestration Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our workload orchestration tech stack across both our High-Performance Computing (HPC) and industry-leading AI Factory platforms. With the rapid expansion of our compute infrastructure, efficiently scheduling, managing, and maximizing the utilization of our CPU and GPU environments is paramount.

You will own the deployment, configuration, and fine-tuning of orchestration platforms that schedule massive, parallel computational workloads. By implementing robust scheduling policies for traditional scientific workflows and modern containerized AI workloads, you will bridge the gap between heavy compute capacity and efficient execution. Your work will directly ensure that Roche’s researchers, data scientists, and engineers can seamlessly run large-scale AI model training and computational science simulations at scale.

Description of the area

Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.

The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.

Job Responsibilities

Orchestration Stack Deployment & Governance

  • Design, implement, and maintain the SLURM Workload Manager ecosystem across our HPC cluster architectures, ensuring high availability and optimal resource distribution.

  • Deploy and manage Run:ai as the core orchestration and virtualization layer for the AI Factory, enabling fractional GPU allocation and dynamic resource allocation.

  • Evaluate, architect, and implement SLURM Slinky integrations where required to seamlessly bridge Kubernetes-based AI orchestration with traditional HPC cluster resources.

Containerization & Workload Optimization

  • Define best practices and frameworks for containerized scientific execution, utilizing Singularity/Apptainer and/or Enroot to provide secure, reproducible performance environments for HPC.

  • Translate user and workload requirements into optimized scheduling parameters (e.g., topology-aware scheduling, multi-node scaling).

  • Actively profile and tune scheduling queues, quality-of-service (QoS) parameters, and fair-share policies to maximize multi-tenant efficiency.

Platform Reliability & Telemetry

  • Partner with Observability Engineers to implement continuous monitoring, telemetry, and reporting dashboards to track scheduler efficiency, queue wait times, and hardware utilization rates.

  • Troubleshoot complex workload failures, including distributed training synchronization issues, MPI communication bottlenecks, and driver incompatibilities.

  • Maintain configuration-as-code models for the scheduling tier, leveraging automation to deploy cluster policies uniformly.

Qualifications

Education / Experience

  • Bachelor’s or an advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or a similar technical discipline.  

  • 5+ years of systems engineering experience, with a heavy emphasis on workload scheduling, resource management, and cluster optimization for multi-tenant environments.

  • Deep technical familiarity with Enterprise Linux operating systems and distributed systems architecture.

  • HPC Scheduling & Tooling: Expert-level proficiency in administering SLURM, including complex partition designs, accounting, and plug-in management. Highly proficient with Singularity for container runtime execution.

  • AI Orchestration: Hands-on experience or deep architectural understanding of Run:ai, Kubernetes, and containerized GPU scheduling paradigms.

  • Infrastructure Literacy: Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL) as they relate to job placement.

  • Automation: Proficiency in automating scheduler configurations and telemetry gathering, or infrastructure automation tooling.

Leadership & Mindset:

  • Lean & Agile Mindset: Highly focused on driving efficiency, reducing idle compute time, and creating frictionless pathways for user workload submissions.  

  • Collaboration & Advocacy: Outstanding capability to translate scientific and AI model workflow challenges into scalable scheduler configurations. 

  • Intellectual Curiosity: A strong passion for remaining ahead of industry trends regarding GPU slicing, fractionalization, and the convergence of AI workloads with traditional HPC schedulers. 

 

 

Who we are

A healthier future drives us to innovate. Together, more than 100’000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.


Let’s build a healthier future, together.

Roche is an Equal Opportunity Employer.

Sponsored
⭐ Featured Partner

Explore Sports Tech Careers

Discover exciting opportunities in sports technology. Join innovative companies transforming the sports industry through data, media, and cutting-edge tech.

Remote FriendlyCompetitive SalarySports Tech

Create a Job Alert

Interested in building your career at Roche? Get future opportunities sent straight to your email.

Create Alert

Related Opportunities

Discover similar positions that might interest you