MLOps Infrastructure Engineer

Join our team as an MLOps Infrastructure Engineer, where you'll design and deploy a high-performance platform for distributed machine learning. You'll work with cloud and Kubernetes architecture, develop internal tools for MLOps, and implement DevOps best practices. This role requires 3-4 years of experience in cloud infrastructure, DevOps, or MLOps, as well as proficiency in Kubernetes, cloud GPU management, Python, and CI/CD. Bonus skills include low-level optimization, backend/API experience, and designing partner-facing tools.

Résumé suggéré par Welcome to the Jungle

CDI
Paris
Télétravail occasionnel
Salaire : Non spécifié
Expérience : > 5 ans
Éducation : Bac +5 / Master
Missions clés

Concevoir et déployer une plateforme pour rendre les GPU, les clusters et l'entraînement distribué transparents.

Développer et améliorer l'orchestrateur interne pour simplifier l'entraînement distribué.

Mettre en œuvre l'Infrastructure-as-Code (Terraform/Pulumi) pour la reproductibilité et l'évolutivité.

Sigma Nova
Sigma Nova

Cette offre vous tente ?

Questions et réponses sur l'offre

Le poste

Descriptif du poste

The Challenge: Build the Platform That Powers Research and Beyond

Your mission: Design and deploy the platform that makes GPUs, clusters, and distributed training transparent, not just for internal research, but also as a foundation for monetizable capabilities (e.g., managed training services, optimised inference pipelines for partners).

What You’ll Do

  • Cloud & Kubernetes Architecture:

    • Build and maintain a high-performance, multi-tenant environment on Scaleway and GENCI, optimised for distributed ML.

    • Deploy and supervise a Slurm cluster for research workload, ensuring seamless integration with Scaleway’s infrastructure.

    • Automate scaling, resource allocation, and cost management to avoid technical debt.

  • MLOps & Internal Tools:

    • Develop and enhance our internal orchestrator to simplify distributed training (FSDP, data pipelines) for both researchers and external users.

    • Create reusable frameworks for monitoring, logging, efficiency, and cost tracking.

    • Collaborate with research teams to industrialise workflows (e.g., model alignment, large-scale finetuning) and package them as deployable capabilities.

  • DevOps & Software Craftsmanship:

    • Implement Infrastructure-as-Code (Terraform/Pulumi) for reproducibility and scalability.

    • Write clean, typed, and documented Python code

    • Troubleshoot at the intersection of hardware (GPUs, networking) and software (PyTorch, CUDA), ensuring robustness for both internal and external use cases.


Profil recherché

Key Skills

  • Experience: 3–4 years in cloud infrastructure, DevOps, or MLOps (research or industry).

  • Technologies:

    • Kubernetes/Docker: Advanced orchestration and containerization.

    • Cloud GPU Management: Scaleway, AWS/GCP (clusters, networking, storage).

    • Python: Proficiency in PEP standards, typing, and testing.

    • MLOps: Data pipelines, distributed training (PyTorch, FSDP), monitoring.

    • CI/CD: Pipeline setup and maintenance.

    • Fluent English (the team speaks English in the day-to-day)

Bonus Skills

  • Low-level optimisation (Triton, CUDA), HPC, or large-scale training experience.

  • Backend/APIs (FastAPI, gRPC) for exposing models or services.

  • Experience designing partner-facing tools or managed services.

Beyond Technical Skills:

While technical excellence is critical, we place equal importance on how we work together. We believe the best teams are built on:

  • Integrity & Respect

    • We are striving for honesty, kindness, and fairness. We value people who treat others with dignity and foster an environment where everyone feels heard.
  • Open Communication & Humility

    • Great ideas come from collaboration. We look for teammates who listen actively, communicate clearly, and approach challenges with self-awareness and humility.
  • Psychological Safety & Camaraderie

    • We strive to create a space where people feel safe to take risks, ask questions, and grow.

Déroulement des entretiens

  • Prescreen with Paul (Head of People)

  • Technical Screen with one Research Scientist or Research Engineer

  • On-site (Take-home exercise and restitution OR On site live interviews + Behavioural interview)

Envie d’en savoir plus ?

D’autres offres vous correspondent !

Ces entreprises recrutent aussi au poste de “Data / Business Intelligence”.

  • Keyrus

    Expert Data Management (H/F/NB)

    Keyrus
    Keyrus
    CDI
    Levallois-Perret
    Télétravail fréquent
    Intelligence artificielle / Machine Learning, IT / Digital
    3 000 collaborateurs

  • skiils

    Data Steward - Lille

    skiils
    skiils
    CDI
    Suresnes
    Télétravail fréquent
    Salaire : 50K à 60K €
    Intelligence artificielle / Machine Learning, Transformation
    150 collaborateurs

  • Theodo Data & AI

    Candidature spontanée - Theodo Data & AI

    Theodo Data & AI
    Theodo Data & AI
    CDI
    Paris
    Télétravail non autorisé
    Intelligence artificielle / Machine Learning, IT / Digital
    70 collaborateurs

  • Sekoia.io

    Data Engineer

    Sekoia.io
    Sekoia.io
    CDI
    Paris
    Télétravail fréquent
    Logiciels, Intelligence artificielle / Machine Learning
    140 collaborateurs

  • Metroscope

    Data Scientist x Software Engineer - Intermediate

    Metroscope
    Metroscope
    CDI
    Paris
    Télétravail fréquent
    Salaire : 55K à 65K €
    Logiciels, Intelligence artificielle / Machine Learning
    55 collaborateurs

  • QuantCube Technology

    Lead MLOps

    QuantCube Technology
    QuantCube Technology
    CDI
    Paris
    Télétravail fréquent
    Salaire : ≥ 65K €
    Intelligence artificielle / Machine Learning, FinTech / InsurTech
    78 collaborateurs

Voir toutes les offres