WET) | Christian S. Perone

blog.christianperone.com

cross-posted to:
[email protected]

Large language model data pipelines and Common Crawl (WARC/WAT/WET) | Christian S. Perone

blog.christianperone.com

lautan to Machine Learning - Learning/Language Models@lemmy.intai.techEnglish · 10 months ago

cross-posted to:
[email protected]

This article provides a short introduction to the pipeline used to create the data to train large language models (LLMs) such as LLaMA using Common Crawl (CC).

This article provides a short introduction to the pipeline used to create the data to train large language models (LLMs) such as LLaMA using Common Crawl (CC).

You must log in or register to comment.

Chat

Machine Learning - Learning/Language Models@lemmy.intai.tech

models@lemmy.intai.tech

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Discussion of models, thier use, setup and options.

Please include models used with your outputs, workflows optional.

Model Catalog

We follow Lemmy’s code of conduct.

Communities

Useful links

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
1 user / 6 months
2 local subscribers
2 subscribers
78 Posts
8 Comments
Modlog

mods:
manitcor@lemmy.intai.tech