Tato pozice již není k dispozici.

Research Data Science Intern - Tabular Data Augmentation

Jiné
Paris
Plat: Neuvedeno
Žádná práce na dálku

Dataiku
Dataiku

Máte zájem o tuto nabídku?

jobs.faq.title

Pozice

Popis pozice

Headquartered in New York City, Dataiku was founded in Paris in 2013 and achieved unicorn status in 2019. Now, more than 1,000+ employees work across the globe in our offices and remotely. Backed by a renowned set of investors and partners including CapitalG, Tiger Global, and ICONIQ Growth, we’ve set out to build the future of AI.

We are looking for a research intern to join Dataiku’s Lab in our Paris office for a 6-month internship on Tabular Data Augmentation (TDA).

Data Augmentation has been successfully used to improve the predictive power and generalisation of deep neural nets in visual tasks. However, the difficulty of defining invariances for tabular data as well as dealing with categorical variables has long limited the use of TDA. Nevertheless, TDA in the latent space, based on generative models, such as Variational Auto-Encoders or Generative Adversarial Networks, seem to overcome these difficulties providing realistic synthetic samples, especially useful to augment minority classes.

In imbalanced classification tasks or in settings where some groups are poorly represented, TDA should help improve local performances on those classes or groups. With this internship we aim at exploring the generative approaches to TDA and compare them to traditional oversampling techniques in imbalanced classification settings.

Several models have been proposed to generate synthetic data, such as TVAE [1], CTGAN [1], CopulaGAN, MixupGAN, Great [2], and various strategies to augment data both in the input and in the latent space (TailCalibration [3]), but an exhaustive benchmark on imbalanced data is currently missing.

In order to build a trustful TDA method, it is critical to provide practitioners with quality metrics showing the similarity of the synthetic distributions with the real data. Indeed, as synthetic data is automatically generated it could be based on perturbations that can change the class of samples and thus be harmful for training ML models.

This internship focuses on designing optimal strategies to perform data augmentation in imbalanced tabular classification tasks. We will first study state-of-the-art TDA models and how TDA can improve local accuracy of ML models trained on real and synthetic data. In a second step the intern will design the best TDA strategy and quality metrics to ensure synthetic and real data have similar distributions. This study will be used to recommend a TDA tool to be included in Dataiku’s data science software.

Your mission will be to:

  • Get familiar with the domain
  • Run a through benchmark of TDA strategies on datasets of various imbalance
  • Define and validate quality metrics of synthetic/real data similarity
  • Identify/define a robust TDA strategy and check metrics

You are our ideal candidate if:

  • You are eager to get your hands dirty and dive into coding
  • You know that bagging and boosting trees is not about gardening

Ideal technical skills:

  • Good understanding of parametric machine learning algorithms and their optimisation
  • Good experience with Python development; alternatively experience with an object-oriented language such as Java or Scala
  • Some experience working with deep learning frameworks, esp. Keras or Pytorch, for both supervised and unsupervised text/tabular learning

Ref:

[1]: Conditional GAN

[2]: Language models as tabular data generator

[3]: Long-Tail Calibration

About Dataiku:

Dataiku is the platform for Everyday AI, systemizing the use of data for exceptional business results. By making the use of data and AI an everyday behavior, Dataiku unlocks the creativity within individual employees to power collective success at companies of all sizes and across all industries. Don’t get us wrong: we are a tech company building software. Our culture is even pretty geeky! But our driving force is and will always remain people, starting with ours. We consider our employees to be our most precious asset, and we are committed to ensuring that each of them gets the most rewarding, enjoyable, and memorable work experience with us. Fly over to Instagram to learn more about our #dataikulife.

Our practices are rooted in the idea that everyone should be treated with dignity, decency and fairness. Dataiku also believes that a diverse identity is a source of strength and allows us to optimize across the many dimensions that are needed for our success. Therefore, we are proud to be an equal opportunity employer. All employment practices are based on business needs, without regard to race, ethnicity, gender identity or expression, sexual orientation, religion, age, neurodiversity, disability status, citizenship, veteran status or any other aspect which makes an individual unique or protected by laws and regulations in the locations where we operate. This applies to all policies and procedures related to recruitment and hiring, compensation, benefits, performance, promotion and termination and all other conditions and terms of employment. If you need assistance or an accommodation, please contact us at: reasonable-accommodations@dataiku.com

Chcete se dozvědět více?