Research Internship - Podcast Topic Modeling

Paris, Île-de-France, FranceIntern

Company Description

We are music and tech fans hailing from all over the globe, working to make Deezer the most personal music streaming service. From data scientists to tech experts, artists & labels specialists to marketers, and even in-house music editors, our team is spreading the love for music to over 180 countries. Supporting local and international artists and bringing them closer to their fans is our mission - we believe music is about diversity, multiculturalism and togetherness. Ready to join the team? We're all ears.

Job Description

How about you?

Podcasts are a special type of audio content used for entertainment, information or advertisement [1]. They are frequently considered the “spoken” version of blog posts [2]. Allowing the users to retrieve speech files effectively from an increasingly large item set and providing automatic podcast recommendation is essential for streaming services [1,2]. In contrast to music recommendation and retrieval, users put a lot more emphasis on the podcast topics compared to the podcast audio style [1]. Consequently, annotating podcasts with topics is a prerequisite for both search and content-based recommendation.

Previous works have exploited different types of data in order to automatically infer topic tags: metadata such as podcast title [3] and the transcribed spoken content [1,2,3], each proving suitable in different user search scenarios [3]. Although the transcribed speech is a richer source of data for topic extraction, using only metadata is considered a more economic alternative [3]. Additionally, with the advancements made in topic modeling on short texts, fostered by the latest NLP trends in language representation in low-dimensional embedding spaces [4,5,6], the question of whether we are able to obtain relevant topic representations even from short podcast titles is worth further investigation.

The objective of this internship is thus to propose a method to model topics on short text starting from the latest related literature [4,5,6] and assess its suitability in the podcast scenario. The proposed method has to be compared with topic modeling when using the transcribed speech only. Moreover, topic modeling approaches combining both the short metadata and noisy transcribed content may be investigated. For this, the intern is expected to review existing literature, propose a solution, design a suitable experimental protocol for evaluation and report the results in a scientific report or article.

The intern is supervised by research scientists and research engineers from the Deezer R&D team who provide practical and scientific help with the performed task. The intern is nonetheless encouraged to propose solutions and work autonomously. For data experiments, Deezer ensures cutting edge technology and appropriate calculus power.

References

[1] Longqi Yang, Yu Wang, Drew Dunne, Michael Sobolev, Mor Naaman, and Deborah Estrin. 2019. More Than Just Words: Modeling Non-Textual Characteristics of Podcasts. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19). ACM, New York, NY, USA, 276-284. DOI: https://doi.org/10.1145/3289600.3290993

[2] J. Mizuno, J. Ogata and M. Goto. 2008. A similar content retrieval method for podcast episodes. In Proceedings of the 2008 IEEE Spoken Language Technology Workshop, Goa, 297-300.DOI: 10.1109/SLT.2008.4777899

[3] Besser, J., Larson, M. and Hofmann, K. 2010. Podcast search: user goals and retrieval technologies. Online Information Review, Vol. 34 No. 3, pp. 395-419. https://doi.org/10.1108/14684521011054053

[4] Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '16). ACM, New York, NY, USA, 165-174. DOI: https://doi.org/10.1145/2911451.2911499

[5] Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In Proceedings of the 2018 World Wide Web Conference (WWW '18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1105-1114. DOI: https://doi.org/10.1145/3178876.3186009

[6] Felipe Viegas, Sérgio Canuto, Christian Gomes, Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo Rocha, and Marcos André Gonçalves. 2019. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19). ACM, New York, NY, USA, 753-761. DOI: https://doi.org/10.1145/3289600.3291032 

Qualifications

What we are looking for :

  • Master or PhD student with a background in Computer Science / Computational Linguistics / Applied Mathematics / Statistics.
  • Strong knowledge of natural language processing, applied machine learning and data mining
  • Good programming skills for data processing and experimentation (preferred python, but we are open to other technologies too)
  • Creativity and autonomy

Additional Information

Life @ Deezer HQ:

> Start-up environment with an at home vibe and outdoor space
> Kitchen stocked with free drinks and snacks daily
> Friday drinks & seasonal parties
> Gym access, plus yoga, pilates and boxing classes
> English and French language courses
> Hackathons & meetups

We are an equal opportunity employer