β-reduction is an independent lab. We focus on creating extensive training datasets for AI researchers and companies.
Most LLMs, including GPT-3 and LLaMa, begin their training with the Common Crawl dataset. This dataset, sourced from the open internet, forms the foundational layer of training, and also provides the model with current knowledge of the world.
However, Common Crawl has limitations: 1) The dataset isn't frequently updated, with 5 releases in 2023; 2) It contains about 2 billion pages, far less than Google's estimated 200 billion indexed pages, indicating a significant gap.
Our initial project aims to replicate Common Crawl while addressing these two issues. We have created a dataset that's larger than Common Crawl and we plan to update weekly.
We're releasing the initial dataset for free. You can download a sample of data here. Please contact us for the full version.