Transcribing YouTube Videos for LLM Training

May 7, 2024 | by

Pleias, a French startup that builds energy-efficient  large language models (LLMs) for information-sensitive industries, has released a dataset called YouTube-Commons that contains over two million copyright-free video transcripts. YouTube-Commons includes full transcripts of each YouTube video, making it one of the largest collections of conversational data with nearly 30 billion words. The dataset provides LLM developers with large amounts of freely available data for training. 

Get the data.

Image credit: Alexander Shatov


View all

view all