This article provides a short introduction to the pipeline used to create the data to train large language models (LLMs) such as LLaMA using Common Crawl (CC).