Theodo France

Theodo France

IT / Digital, Logiciels

Paris, Casablanca, London, Lyon, Nantes

Explorez leurs posts

Parcourez les posts publiés par l’entreprise pour une immersion dans leur culture et leurs domaines d’expertise.

Theodo Named 2025 Google Cloud Services Partner of the Year

Las Vegas, April 8, 2025 Theodo today announced that it has received the 2025 Google Cloud Services Partner of the Year. Theodo is being recognized for its achievements in the Goog…

04/07/2025

REX après la migration d’une feature Android/iOS vers KMP

Dans un premier article, nous avons exploré comment migrer progressivement une application native Android/iOS vers Kotlin Multiplatform (KMP). Si vous avez suivi ce guide, vous ave…

04/07/2025

ShedLock : Gérer efficacement les tâches planifiées dans une architecture multi-instances

Introduction Pour les applications d'entreprise, en 2025 l’exécution de tâches planifiées est un besoin courant : génération de rapports, traitements par lots, synchronisation de d…

04/07/2025

Comment diviser par 2 votre temps de CI grâce à ces simples astuces

Chez Theodo HealthTech, notre politique est de créer un code de qualité tout en respectant nos délais de livraison. Cependant, pendant trois sprints consécutifs, nous avons livré d…

08/04/2025

How QRQC helps me better manage bugs on my project?

As a software developer, you spend most of your time delivering code and then fixing bugs you introduced in your code. The Pareto principle is a great way to represent our day-to-d…

04/07/2025

How we migrated from AWS Glue to Snowflake and dbt

Today, I’ll tell you about our ETL migration from AWS Glue to the Snowflake and dbt modern data stack. My team’s mission is to centralize and ensure data reliability from multiple…

04/07/2025

Sécuriser les partages Notion : guide pour tous

Chez Theodo Fintech, nous avons fait de Notion notre QG digital. Documentation produit, projets tech, comptes rendus, référentiels clients ou financiers… tout y passe. Et c’est bie…

04/07/2025

Construire un produit Gen AI : le guide de survie pour les PMs

Bienvenue dans l'ère de l'IA générative, où les machines créent du contenu de manière autonome. Pour les product managers (PMs), cela représente une révolution autant qu'un défi. L…

08/04/2025

Qu’est-ce que le scaling ?

Une bonne application est une application qui tient sa charge d’utilisateurs, notamment grâce à un scaling controlé. Dans cet article nous aborderons ce sujet, et plus particulière…

04/07/2025

How LLM Monitoring builds the future of GenAI ?

Discover how Langfuse offers secure, open-source monitoring for LLM and GenAI solutions. Large Language Models (LLMs) are popping up everywhere and are more accessible than ever. W…

04/07/2025

Les annotations java custom pour respecter l’architecture hexagonale

Le problème que l’on veut résoudre Lorsque l’on développe une API avec Spring Boot, il est fréquent d’utiliser les annotations fournies par le framework, telles que @Service, @Comp…

04/07/2025

Le kit de survie du Product Manager responsable en 4 étapes

Alors que j’étais tranquillement en train de me laisser charmer par la dégustation gratuite de tofu fumé de mon Biocoop l’année dernière, une soudaine prise de conscience a heurté…

08/04/2025

Faites des Plugins pas la Guerre: REX sur ma bataille pour écrire un plugin

Imaginez commencer chaque projet avec tous les outils configurés et prêts à l’emploi. Le rêve, non ? En tant que développeur Android, j’ai toujours eu à portée de main les outils n…

04/07/2025

Don’t use Langchain anymore : Atomic Agents is the new paradigm !

Introduction Since the rise of LLMs, numerous libraries have emerged to simplify their integration into applications. Among them, LangChain quickly established itself as the go-to…

04/07/2025

Optimize Your Google Cloud Platform Costs with Physical Bytes Storage Billing

In today's data-driven world, cloud providers are essential for efficiently managing, processing, and analyzing vast amounts of data. When choosing one such provider, Google Cloud…

08/04/2025

Tech Radar Cloud

Ce Tech radar regroupe une cinquantaine de technologies Cloud et DevOps éprouvées par les experts de Theodo Cloud durant plus de 4 ans de projets. Téléchargez le 2ème volume de not…

08/04/2025

How we migrated from AWS Glue to Snowflake and dbt

Theodo France

How we migrated from AWS Glue to Snowflake and dbt

Today, I’ll tell you about our ETL migration from AWS Glue to the Snowflake and dbt modern data stack. My team’s mission is to centralize and ensure data reliability from multiple business domains. We build an analytical foundation to support product and strategic teams' decision-making. As part of this effort, we migrated 28 gold models that power 13 data products, including dashboards, profiling tools, and analytical studies. These models serve as a cornerstone for enabling data-driven insights.

Initial situation : a degraded work environment

  1. The lack of documentation and governance made errors difficult to log and trace. Also, there was no proper development environment: Each change posed a risk to production, leading to constant stress for developers.

  2. Data transformations were done on AWS Glue with Spark SQL, complex tasks that required a time-consuming environment to set up. It made it impossible to test data quality during development.

  3. All users had the same level of access to data, regardless of their role within the organization. This lack of access control made managing permissions unclear. It was impossible to ensure proper data security.

A Welcome Migration

We opted for a Data stack combining dbt and Snowflake. This solution provides dedicated environments, data quality tests, and integrated documentation via dbt as code. It also enables enhanced data governance. dbt is a framework for data transformations, quality and documentation and Snowflake is a cloud-based data warehouse.

The Impact on Data Engineers’ Daily Work

As Data Engineers, we undertook the migration of all transformations to automate data processes such as cleaning and transforming data, while also rebuilding dashboards and KPIs shared with our clients.

The Previous Solution

Each transformation was stored in a Git repository and duplicated on AWS as Glue jobs. These Python jobs used a Spark environment and were orchestrated in AWS Workflows. All the data within the medallion architecture was stored in the AWS Glue Data Catalog.

To test each transformation, I ran the scripts directly in production, without CI/CD. Each developer had to install a Jupyter Notebook and a Glue PySpark kernel, which took about 40 seconds to initialize. This process was our only way to iterate quickly and validate the completeness and accuracy of data transformations. The lack of CI/CD prevented us from detecting errors before production deployment 😈

Capture d’écran 2025-07-04 à 10.38.19.png

The New Solution

With dbt, I was able to document our entire data architecture using its documentation as code. I was also able to implement quality tests and unit tests to validate our transformations and detect bugs. It became possible to preview the data, test model execution, and view data lineage through the dbt power user extension. And all of this on a dedicated development environment for our entire medallion architecture!

Finally, Snowflake offered us more precise resource control through its Billing and Cost Management tool, allowing for consumption and cost optimization. Compared to AWS Glue, which can lead to high costs due to auto-scaling, we expect a reduction in our infrastructure costs.

Capture d’écran 2025-07-04 à 10.38.39.png

Preparation: Mapping and Planning the Migration

Our migration began with setting up the Snowflake environments and access. We built the dbt repository and distributed Snowflake user access based on their roles.

The Iso-functional Migration of Data Products

Break everything and rebuild it? 🧐 How to ensure we didn’t forget any data model? 😱 How can we reassure Justine her dashboard is still reliable? How do we explain Thomas his key metric is now multiplied by 5?

We froze and listed the entities that have to be migrated and described them according to the medallion architecture to eliminate these risks. Our solution architecture consists of data in Bronze, Silver, and Gold states (classic ETL) and dashboarding and KPI products consumable in Qlik and Metabase. It felt hard to list everything we would need to migrate, but without that, we’d have no control over the migration and no way to measure it.

Our strategy:

  • Migrate the Bronze → Silver transformations

  • Once the Silver layer is reliable, migrate the Silver → Gold transformations

  • Once the Gold layer is reliable, migrate dashboards

For each entity to migrate, we had to capture the existing behavior and validate the equivalence of the reproduced model using the Golden Master refactoring strategy. We modularized the transformations using Data Modeling: an excellent source of documentation that helped us understand the analytical work challenges and allowed us to do things quickly and efficiently.

We transcribed the Python scripts using Spark SQL into Snowflake SQL models in dbt. For each model, we used an in-house 🏠 migration monitoring script. This script allowed us to compare the migrated data with the source data. The following script connects to both AWS and Snowflake environments and compares the dataframes!

Capture d’écran 2025-07-04 à 10.39.07.png

Deep dive into dbt challenges and solutions

In practical terms, how did it go for the Data Engineers? Migrating the transformations was pretty easy. The real challenge was understanding and justifying the differences found once the models were migrated.

We had to interview the users of the legacy models. The code was poorly understood and lacked documentation. Some models were discarded, while others were rebuilt by redefining the needs of the consumers.

For example, for weekly or monthly aggregated KPIs, we built calendar tables. These temporal tables required a complete overhaul, but Snowflake’s advanced SQL features, such as recursive queries, made the work much easier.

Tips for a recursive calendar table with Snowflake SQL

Capture d’écran 2025-07-04 à 10.39.34.png

Separating DEV and PROD environments

Given the lack of separate environments in the old architecture, we had to set up distinct environments for development and production to ensure that new features didn’t impact the production solution. To achieve this, we designed the Snowflake architecture by isolating environments and adjusting the impacted entities, such as schemas, databases, and warehouses. We created databases for each state of the medallion architecture and each environment: Silver, Silver DEV, Gold, Gold DEV.

With dbt, we defined targets in the profiles.yml configuration file, allowing modifications to be integrated into the development environment without affecting production. CI/CD manages the production environment by explicitly targeting behaviors for PROD - Merge.

Capture d’écran 2025-07-04 à 10.39.56.png

To further automate the validation of impacts on data models, we introduced the create_volumetric_report macro. This macro generates a comprehensive report on key data metrics, including row count, distinct values, and null percentages, for each column in the model. This allows us to automatically check for any changes that could affect data quality across environments.

Capture d’écran 2025-07-04 à 10.40.27.png

Quality tests and documentation

Migrating, testing, documenting: it seemed like a lot, yet we did it!

We started with a solution that had no tests, and we needed to detect and trace anomalies. The dbt documentation in yaml configuration files helped uncover previously invisible bugs. It significantly improved data quality and enabled the addition of documentation!

To ensure data compliance at the silver stage of the medallion architecture, we had to implement data quality tests. These checks validated non-null values and uniqueness! To verify our cross-referenced and transformed data, we implemented business tests and unit tests.

Data quality tests and documentation Each consumed data model is linked to a configuration file for tests, governance with tag of PII (Personally Identifiable Information) and documentation like this one. Dbt allows us to control data quality by executing these tests for each code base evolution. The unique command to run is dbt test. Failed tests are assessed, resolved, or prioritized to ensure the maintenance of data quality.

Capture d’écran 2025-07-04 à 10.40.51.png

Unit testsDbt

Unit tests make it possible to test the behavior of SQL scripts in isolation by mocking input and output data and simulating the script's behavior.

Capture d’écran 2025-07-04 à 10.41.11.png

Custom tests

Custom tests allow us to verify the business rules we implement. They are called just like data quality tests in the configuration file of the tested model! The test is validated if no rows are returned when the following code is executed.

Capture d’écran 2025-07-04 à 10.41.45.png

All custom tests are called in model’s configuration file. The test ingredient_is_valid_for_month is called on a model and its column in the model configuration yaml file. Following dim_actors configuration model illustration. For a deeper dive into monitoring dbt tests with Elementary, I recommend this article: Elevate Data Quality Checks with dbt and Elementary Integration Explored.

Our transformed data platform

With Snowflake and dbt, the results exceeded our expectations. We can now develop and test a model in 15 seconds instead of 5 minutes, and without impacting PROD! We’ve increased data bug identification fivefold 📈 with 475 data quality tests! 🍾

Transformations are orchestrated in a modular and reusable way on dbt, allowing developers to have precise documentation and a clear understanding of each data model. 🎉 Now, we know what’s left to do: prioritize, log, and resolve the bugs.

What we learned from this migration

Setting up a collaborative development environment required creating specific roles and dedicated databases for each developer. These adjustments were essential to ensure a smooth migration and avoid conflicts.

And if we had to do it all over again? In hindsight, we would have strengthened role management by creating an omniscient role for developers and assigning each developer their own Snowflake database. 😱

It taught us that a successful transformation goes beyond simply moving data. It’s an opportunity to rethink practices, improve team collaboration, and strengthen governance. If you’re considering a similar migration, start with a detailed inventory, document each step, and iterate with consumers to align your understanding with their needs! 👫

Want to learn more or benefit from our expertise for your migration projects? Contact us today!

Article rédigé par Chloé Adnet