Corpus Used by Large Language Models (LLMs) for Different Applications

Corpus Used by Large Language Models (LLMs) for Different Application

Large language models (LLMs) are trained on massive datasets of text and code. The corpus that an LLM is trained on can have a significant impact on its performance on different applications.

For example, LLMs that are trained on a corpus of scientific text and data are better at performing scientific tasks, such as citation prediction, scientific QA, and molecular property prediction. LLMs that are trained on a corpus of code are better at generating code and completing programming tasks.

Alpaca: Unlocking Text Generation and Classification:

Alpaca, a member of the LLaMA family, is a decoder-based model fine-tuned from a 7B LLaMA model. It has been evaluated extensively across multiple text generation and classification tasks. Alpaca utilizes a corpus consisting of 52K instruction-following data generated using a self-instruct mechanism from 175 human-written instruction-output pairs. Developed at Stanford, Alpaca showcases the potential of fine-tuning language models for specific applications.

Anthropic Assistant: Optimizing Alignment through Fine-Tuning:

The Anthropic Assistant, a variant of the GPT model, focuses on improving alignment through fine-tuning and prompting. This model includes several optimized versions for different tasks, ranging from general dialog systems to code assistants. Anthropic Assistant leverages a vast corpus of 400B tokens from filtered Common Crawl and Books datasets, supplemented by Dialogue Preference datasets for reinforcement learning from human feedback.

BERT: Advancing General Language Understanding:

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-based model that introduced the concept of masked language modeling (MLM) and next sentence prediction (NSP). Developed by Google, BERT aimed to advance general language understanding and has since become a fundamental model for a wide range of language applications. It was trained on the Toronto Book Corpus and Wikipedia, comprising 3.3B tokens.

BLOOM: Full Attention for Enhanced Performance:

BLOOM, a decoder-based model from the GPT family, differentiates itself from GPT-3 by employing full attention instead of sparse attention mechanisms. Similar to GPT-3, BLOOM serves various NLP applications. It boasts an impressive 176B parameters and utilizes a multilingual dataset comprising 366B tokens, sourced from a vast collection of text data including Massive Web, Books, Github, News, C4, and Wikipedia.

ChatGPT: Empowering Conversational Agents:

ChatGPT, derived from GPT3.5 (also known as GPT3 Davinci-003), leverages reinforcement learning from human feedback (RLHF) for fine-tuning. Developed by OpenAI, ChatGPT extends beyond being a mere language model and includes memory store and retrieval capabilities similar to BlenderBot3. It utilizes the same datasets as GPT3 for RLHF, enabling enhanced performance in conversational agents.

Chinchilla: Optimized Performance with Reduced Model Size:

Chinchilla, similar to Gopher and GPT-3, employs optimizations to reduce model size and training/inference time without compromising performance. Trained on a massive multilingual dataset comprising 70B tokens sourced from Massive Text, including web pages, books, GitHub, news, C4, and Wikipedia, Chinchilla delivers equal or superior performance while reducing resource requirements.

DALL-E: From Text to Images:

DALL-E, a decoder-based model developed by OpenAI, specializes in the generation of images from text. Combining a differential variational auto-encoder and a GPT-3 variation, DALL-E learns a visual codebook to transform textual descriptions into stunning images. Trained on 250 million text-image pairs from the internet, DALL-E demonstrates the potential for bridging the gap between text and visual content.

DALL-E-2: Combining Vision and Language:

DALL-E-2 merges the capabilities of CLIP and GLIDE models by combining the CLIP encoder and Diffusion decoder, similar to GLIDE. This encoder-decoder duo excels at generating captions for images based on the fusion of visual and textual understanding. The training data for DALL-E-2 is a combination of the DALL-E and CLIP datasets, allowing it to create accurate and descriptive captions.

Galactica: Empowering Scientific Domains:

Galactica, a transformer-based decoder-only model, is tailored for scientific tasks. It introduces modifications to the transformer architecture and leverages special tokens for working memory, citations, genetic data, and other biology-related tasks. With a vast training corpus of 120B tokens from open-access scientific text and data sources, Galactica enables various scientific applications, including citation prediction, mathematical reasoning, and entity extraction.

This is just a small sample of the many LLMs that are available. The corpus that an LLM is trained on can vary depending on the specific application that it is intended for. By choosing the right corpus, LLMs can be used to achieve state-of-the-art results on a wide range of tasks.

PreviousDatasets List from Dr. Alan Thompson NextWhat are 'Tokens' ?

Last updated 5 months ago