MLOps Infrastructure Engineer

Join our team as an MLOps Infrastructure Engineer, where you'll design and deploy a high-performance platform for distributed machine learning. You'll work with cloud and Kubernetes architecture, develop internal tools for MLOps, and implement DevOps best practices. This role requires 3-4 years of experience in cloud infrastructure, DevOps, or MLOps, as well as proficiency in Kubernetes, cloud GPU management, Python, and CI/CD. Bonus skills include low-level optimization, backend/API experience, and designing partner-facing tools.

jobs.show.blocks.metaData.summary.generated

Indefinido
Paris
Teletrabajo ocasional
Salario: No especificado
Experiencia: > 5 años
Formación: Licenciatura / Máster
jobs.show.blocks.metaData.subtitle.key_missions

Concevoir et déployer une plateforme pour rendre les GPU, les clusters et l'entraînement distribué transparents.

Développer et améliorer l'orchestrateur interne pour simplifier l'entraînement distribué.

Mettre en œuvre l'Infrastructure-as-Code (Terraform/Pulumi) pour la reproductibilité et l'évolutivité.

Sigma Nova
Sigma Nova

¿Te interesa esta oferta?

Preguntas y respuestas sobre esta oferta

El puesto

Descripción del puesto

The Challenge: Build the Platform That Powers Research and Beyond

Your mission: Design and deploy the platform that makes GPUs, clusters, and distributed training transparent, not just for internal research, but also as a foundation for monetizable capabilities (e.g., managed training services, optimised inference pipelines for partners).

What You’ll Do

  • Cloud & Kubernetes Architecture:

    • Build and maintain a high-performance, multi-tenant environment on Scaleway and GENCI, optimised for distributed ML.

    • Deploy and supervise a Slurm cluster for research workload, ensuring seamless integration with Scaleway’s infrastructure.

    • Automate scaling, resource allocation, and cost management to avoid technical debt.

  • MLOps & Internal Tools:

    • Develop and enhance our internal orchestrator to simplify distributed training (FSDP, data pipelines) for both researchers and external users.

    • Create reusable frameworks for monitoring, logging, efficiency, and cost tracking.

    • Collaborate with research teams to industrialise workflows (e.g., model alignment, large-scale finetuning) and package them as deployable capabilities.

  • DevOps & Software Craftsmanship:

    • Implement Infrastructure-as-Code (Terraform/Pulumi) for reproducibility and scalability.

    • Write clean, typed, and documented Python code

    • Troubleshoot at the intersection of hardware (GPUs, networking) and software (PyTorch, CUDA), ensuring robustness for both internal and external use cases.


Requisitos

Key Skills

  • Experience: 3–4 years in cloud infrastructure, DevOps, or MLOps (research or industry).

  • Technologies:

    • Kubernetes/Docker: Advanced orchestration and containerization.

    • Cloud GPU Management: Scaleway, AWS/GCP (clusters, networking, storage).

    • Python: Proficiency in PEP standards, typing, and testing.

    • MLOps: Data pipelines, distributed training (PyTorch, FSDP), monitoring.

    • CI/CD: Pipeline setup and maintenance.

    • Fluent English (the team speaks English in the day-to-day)

Bonus Skills

  • Low-level optimisation (Triton, CUDA), HPC, or large-scale training experience.

  • Backend/APIs (FastAPI, gRPC) for exposing models or services.

  • Experience designing partner-facing tools or managed services.

Beyond Technical Skills:

While technical excellence is critical, we place equal importance on how we work together. We believe the best teams are built on:

  • Integrity & Respect

    • We are striving for honesty, kindness, and fairness. We value people who treat others with dignity and foster an environment where everyone feels heard.
  • Open Communication & Humility

    • Great ideas come from collaboration. We look for teammates who listen actively, communicate clearly, and approach challenges with self-awareness and humility.
  • Psychological Safety & Camaraderie

    • We strive to create a space where people feel safe to take risks, ask questions, and grow.

Proceso de selección

  • Prescreen with Paul (Head of People)

  • Technical Screen with one Research Scientist or Research Engineer

  • On-site (Take-home exercise and restitution OR On site live interviews + Behavioural interview)

¿Quieres saber más?

¡Estas ofertas de trabajo te pueden interesar!

Estas empresas también contratan para el puesto de "{profesión}".

  • Implicity

    Software Engineer - Data Platform

    Implicity
    Implicity
    Indefinido
    Paris
    Unos días en casa
    Salario: 55K a 60K €
    Software, Inteligencia artificial/Aprendizaje automático
    100 empleados

  • Nabla

    Senior Machine Learning Engineer - Speech to Text

    Nabla
    Nabla
    Indefinido
    Paris
    Unos días en casa
    Inteligencia artificial/Aprendizaje automático, Macrodatos
    120 empleados

  • Lenstra

    Senior Analytics Engineer

    Lenstra
    Lenstra
    Indefinido
    Paris
    Teletrabajo ocasional
    Software, Inteligencia artificial/Aprendizaje automático
    30 empleados

  • Sigma Nova

    ML Performance Engineer

    Sigma Nova
    Sigma Nova
    Indefinido
    Paris
    Teletrabajo ocasional
    Inteligencia artificial/Aprendizaje automático
    16 empleados

  • Monk AI

    Senior Machine Learning Engineer

    Monk AI
    Monk AI
    Indefinido
    Paris
    Unos días en casa
    Software, Inteligencia artificial/Aprendizaje automático

  • Artefact

    Open Application

    Artefact
    Artefact
    Indefinido
    Paris
    Unos días en casa
    Inteligencia artificial/Aprendizaje automático, Marketing digital/Marketing de datos
    1500 empleados

Ver todas las ofertas