Cette offre n’est plus disponible.

Stage Recherche: Systeme neuronaux pour l'indexation du Web/Moteurs de Recheche

Stage(3 à 6 mois)
Salaire : Non spécifié
Télétravail non autorisé
Éducation : > Bac +5 / Doctorat

Qwant
Qwant

Cette offre vous tente ?

Questions et réponses sur l'offre

Le poste

Descriptif du poste

Context

Information Retrieval (IR) has seen a change of paradigm over the last few years with the advance of Neural IR, which mostly rely on transformers-based dense representation. Many approaches have been developed over a short period of time with impressive improvements [Lin et al. 2020, Tonellotto 2022] over traditional bag-of-word based ranking methods such as BM25.

These improvements come at a cost, implying that neural IR models are not always efficient enough for being used in a live web-scale search engine where latency is critical: Increasing computation time by several dozens of milliseconds can significantly impact revenue. Recent methods such as ColBERTv2 [Santhanam et al. 2021], based on fast indices based on nearest neighbor search [Boytsov et Byberg 2020], reduced the need of repeatedly encoding documents for each queries and directly proposed ways to index dense vector representations and computing late interactions at query time, hence significantly reducing latency over the usual cross-encoder setting. Alternatives, based on sparse neural IR models [Formal et al. 2021] also allow for fast retrieval, but their latency is much less studied than for their bag-of-words counterparts.

This internship would be conducted within Qwant, a privacy-preserving French search engine that serves over 200 million queries per month in part with its own index and retrieval stack.
Objectives

The goal of this internship would be first to study, implement and evaluate dense Neural IR architectures such as ColBERTv2 [Santhanam et al. 2021] or derived models [Hofstätter et al. 2022] within Vespa, the indexing and retrieval platform used at Qwant. The intern would also be encouraged to explore other types of ranking models, including sparse ones such as SPLADEv2 [Formal et al. 2021].

Provided the preliminary study and models do perform well, integrating these approaches to the full Qwant index would be the next step. The intern would have the unique opportunity to test their implementations on real users by running A/B tests.

More generally, the intern is encouraged to freely experiment their ideas, and in participating in evaluation campaigns such as the TREC Deep Learning track and/or in writing of a research publication in the IR (e.g. ECIR, SIGIR) or machine learning (ICLR) venues.
Bibliography

[Boytsov et Byberg 2020] L. Boytsov et E. Nyberg, « Flexible retrieval with NMSLIB and FlexNeuART », arXiv:2010.14848 [cs], oct. 2020. http://arxiv.org/abs/2010.14848

[Formal et al. 2021] Formal, Thibault, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” arXiv [cs.IR]. arXiv. http://arxiv.org/abs/2109.10086.

[Hofstätter et al. 2022] Hofstätter, Sebastian, Omar Khattab, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022. “Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions Using Enhanced Reduction.” arXiv [cs.IR]. arXiv. http://arxiv.org/abs/2203.13088.

[Lin et al., 2020] Lin, Jimmy, Rodrigo Nogueira, and Andrew Yates. 2020. “Pretrained Transformers for Text Ranking: BERT and Beyond.” arXiv [cs.IR]. arXiv. http://arxiv.org/abs/2010.06467.

[Santhanam et al. 2021] Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2021. “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” arXiv [cs.IR]. arXiv. http://arxiv.org/abs/2112.01488.

[Tonellotto 2022] Tonellotto, Nicola. 2022. “Lecture Notes on Neural Information Retrieval.”

arXiv [cs.IR] http://arxiv.org/abs/2207.13443
Organization

The internship will take place at the Qwant offices with visits to ISIR (remote work is also possible). The internship is supervised by Benjamin Piwowarski from ISIR, and Lara Perinetti and Romain Deveaud from Qwant.

The intern will potentially work with the following tools/technologies:

Deep Learning libraries (PyTorch, TensorFlow, Jax/Flax, Huggingface ecosystem, etc.)PythonVespa indexing and retrieval platform (https://vespa.ai/)Search engine tools (https://github.com/vespa-engine/pyvespa)Git version controlJupyter Environment

Qwant will provide the intern a laptop and access to a remote compute server with GPU capabilities.

Team description

You will work in the Core Search team, in charge of the maintenance and development of Qwant’s own Web search engine.

The team is mainly composed of Data Scientists, Data Engineers and backend developers, working on Big Data and Machine Learning, Information Retrieval and NLP (Natural Language Processing) issues.

This year, Qwant offers two research-oriented internships in collaboration with Benjamin Piwowarski whose work focuses on information retrieval.

This one focuses on the improvement of the ranking algorithm.


Déroulement des entretiens

Entretiens avec l’équipe.

Envie d’en savoir plus ?

D’autres offres vous correspondent !

Ces entreprises recrutent aussi au poste de “Développement de logiciels et de sites Web”.

Voir toutes les offres