Last updated
Last updated
📁 From random files to a Hugging Face dataset in a single step 📁
Corpus Creator is a tool designed to help you easily convert a collection of text files into a dataset suitable for various natural language processing (NLP) tasks.
In particular the app is focused on splitting texts into chunks of a specified size and overlap. This can be useful for preparing data for synthetic data generation, pipelines or annotation tasks.
Login: Start by logging in to your Hugging Face account using the provided login button.
Set Parameters: Customize the chunk size and overlap according to your requirements.
Upload Files: Use the upload button to load file(s) for processing.
Preview Dataset: View the created dataset in a dataframe format before uploading it to the Hugging Face Hub.
Upload to Hub: Optionally, specify the Hub ID and choose whether to make the dataset private before pushing it to the Hugging Face Hub.
See an created using this tool starting from a collection of plain text files.
The resulting text chunks are stored in a dataset that can be previewed and uploaded to the Hugging Face Hub for easy sharing and access by the community. The chunking is done using Llama-index
's classes.
This Gradio app () takes you from your local files to a chunked Dataset (via ) in one step! The goal of the tool is to make it quicker and easier to quickly get some local files you want to get ready for ML tasks into a Hugging Face Dataset. Perfect for building datasets for: - synthetic data pipelines - annotation - RAG - Other ML tasks that start from an HF dataset