Corpus Creator

Datasets and Corpus Creator

📁 From random files to a Hugging Face dataset in a single step 📁

Corpus Creator is a tool designed to help you easily convert a collection of text files into a dataset suitable for various natural language processing (NLP) tasks.

In particular the app is focused on splitting texts into chunks of a specified size and overlap. This can be useful for preparing data for synthetic data generation, pipelines or annotation tasks.

See an example dataset created using this tool starting from a collection of plain text files.

The resulting text chunks are stored in a dataset that can be previewed and uploaded to the Hugging Face Hub for easy sharing and access by the community. The chunking is done using Llama-index's SentenceSplitter classes.

Usage:

  • Login: Start by logging in to your Hugging Face account using the provided login button.

  • Set Parameters: Customize the chunk size and overlap according to your requirements.

  • Upload Files: Use the upload button to load file(s) for processing.

  • Preview Dataset: View the created dataset in a dataframe format before uploading it to the Hugging Face Hub.

  • Upload to Hub: Optionally, specify the Hub ID and choose whether to make the dataset private before pushing it to the Hugging Face Hub.

This Gradio app (https://lnkd.in/eKcHfyPs) takes you from your local files to a chunked Hugging Face Dataset (via LlamaIndex) in one step! The goal of the tool is to make it quicker and easier to quickly get some local files you want to get ready for ML tasks into a Hugging Face Dataset. Perfect for building datasets for: - synthetic data pipelines - annotation - RAG - Other ML tasks that start from an HF dataset

Last updated