Qwant
Stage Recherche: Structure HTML et Représentation des documents
- Stage (3 à 10 mois)
- Paris, Nice, Le Petit-Quevilly
- Éducation : > Bac +5 / Doctorat
- Expérience : Non spécifié
Le poste
Stage Recherche: Structure HTML et Représentation des documents
- Stage (3 à 10 mois)
- Paris, Nice, Le Petit-Quev…
- Éducation : > Bac +5 / Doctorat
- Expérience : Non spécifié
Cette offre a été pourvue !
Qui sont-ils ?
Lancé en 2013, conçu et développé avec passion en France, Qwant est le moteur de recherche européen qui respecte la vie privée de ses utilisateurs. Afin de garantir la meilleure expérience utilisateur, Qwant s’appuie sur son propre index du Web, des équipes pleines d’audace et sur des technologies innovantes de Machine Learning, et de Natural Language Processing.
Qwant se base sur trois piliers fondamentaux : offrir un service de recherche internet de qualité, offrir une vision responsable du web et au cœur de tout, respecter la vie privée de ses utilisateurs.
Ainsi, l’entreprise ne collecte pas les données personnelles et ne propose aucune publicité ciblée. Les algorithmes de classement des informations garantissent pour chaque requête utilisateur des résultats pertinents qui ne sont pas influencés par la collecte de données personnelles.
Aujourd’hui le moteur de recherche Qwant compte chaque mois près de 6 millions d’utilisateurs dans le monde et répond à plus de 2 milliards de requêtes.
Nous proposons différents services respectant la vie privée de nos utilisateurs : un moteur de recherche Qwant Search, disponible aussi pour les 6-12 ans avec Qwant Junior, une cartographie Qwant Maps, un bloqueur de traqueurs-cookies avec l’extension Qwant VIPrivacy.
Rencontrez Raphaël, CEO
Rencontrez Stéphanie, Head of Infrastructure
Rencontrez Lara, Data Scientist
Descriptif du poste
Context
This year, Qwant offers two research-oriented internships in collaboration with Benjamin Piwowarski whose work focuses on information retrieval.
Information Retrieval (IR) models aim at predicting which documents within a potentially huge collection are relevant to a given user information need (usually a query). Current models of Information Retrieval, like in many other fields, are nowadays based on transformer architectures.
More precisely, two types of model are now prevalent: (1) representation-based techniques, where the document and the query representations are computed separately (dense or sparse vector) before using a matching scoring function (e.g. inner product); (2) interaction-based techniques, where both the query and the document content are used to compute a relevance score.
Current research focuses on how to (pre)train the models and the problem of modeling the task better, i.e., how to compute the representation of the document and/or the query, or of both the query and document. Improving the quality of the representation is key to building successful (transformer) models for IR, as shown in the best-performing models to date [Gao and Callan, 2021].
Objectives
In the context of Web search, when dealing with web pages, the Document Object Model (DOM) tree represents the document’s structure [Gupta et al., 2003]. Recent work on transformer-based models shows that this structure can be encoded explicitly [Ainslie et al., 2020] or implicitly [Aghajanyan et al., 2021] in the model. One recent approach [Guo et al., 2022] proposes to separate the encoding of the text content from the node structure, before using both representations as a basis for dense ranking.
The goals of this internship will be to study how the HTML structure can be leveraged to (1) build better document representations by exploiting the inner HTML structure and/or the hyperlinks between the documents; and (2) provide a better pre-training (i.e. without the supervision of query paired with relevant documents).
The intern is encouraged to develop their own ideas, and to publish in (inter)national venues and/or to participate in international evaluation campaigns (such as TREC).
Bibliography
[Gao and Callan, 2021] L. Gao and J. Callan, “Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval,” arXiv:2108.05540 [cs], Aug. 2021 [Online]. Available: http://arxiv.org/abs/2108.05540 .
[Gupta et al., 2003] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of HTML documents,” in Proceedings of the twelfth international conference on World Wide Web - WWW ‘03, Budapest, Hungary, 2003, p. 207, doi: 10.1145/775152.775182 [Online]. Available: http://portal.acm.org/citation.cfm?doid=775152.775182 .
[Ainslie et al., 2020] J. Ainslie et al., “ETC: Encoding Long and Structured Inputs in Transformers,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 2020, pp. 268–284, doi: 10.18653/v1/2020.emnlp-main.19 [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-main.19 .
[Aghajanyan et al., 2021] Aghajanyan, Armen, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettlemoyer. “HTLM: Hyper-Text Pre-Training and Prompting of Language Models.” ArXiv:2107.06955 [Cs], July 14, 2021 [Online]. Available: http://arxiv.org/abs/2107.06955.
[Guo et al., 2022] Yu Guo, Zhengyi Ma, Jiaxin Mao, Hongjin Qian, Xinyu Zhang, Hao Jiang, Zhao Cao, and Zhicheng Dou. 2022. Webformer: Pre-training with Web Pages for Information Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘22). Association for Computing Machinery, New York, NY, USA, 1502–1512. https://doi.org/10.1145/3477495.3532086
Organization
The internship will take place at the Qwant offices with visits to ISIR (remote work is also possible). The internship is supervised by Benjamin Piwowarski from ISIR, and Lara Perinetti and Romain Deveaud from Qwant.
The intern will potentially work with the following tools/technologies:
Deep Learning libraries (PyTorch, TensorFlow, Jax/Flax, Huggingface ecosystem, etc.)
Python
Vespa indexing and retrieval platform (https://vespa.ai/)
Search engine tools (https://github.com/vespa-engine/pyvespa)
Git version control
Jupyter Environment
Qwant will provide the intern a laptop and access to a remote compute server with GPU capabilities.
Team description
You will work in the Core Search team, in charge of the maintenance and development of Qwant’s own Web search engine.
The team is mainly composed of Data Scientists, Data Engineers and backend developers, working on Big Data and Machine Learning, Information Retrieval and NLP (Natural Language Processing) issues.
This year, Qwant offers two research-oriented internships in collaboration with Benjamin Piwowarski whose work focuses on information retrieval.
This one focuses on the improvement of web documents representation.