Transcribing YouTube Videos for LLM Training

May 7, 2024

Pleias, a French startup that builds energy-efficient  large language models (LLMs) for information-sensitive industries, has released a dataset called YouTube-Commons that contains over two million copyright-free video transcripts. YouTube-Commons includes full transcripts of each YouTube video, making it one of the largest collections of conversational data with nearly 30 billion words. The dataset provides LLM developers with large amounts of freely available data for training. 

Image credit: Alexander Shatov


