Corpus Creator

Datasets and Corpus Creator

📁 From random files to a Hugging Face dataset in a single step 📁

Corpus Creator is a tool designed to help you easily convert a collection of text files into a dataset suitable for various natural language processing (NLP) tasks.

In particular the app is focused on splitting texts into chunks of a specified size and overlap. This can be useful for preparing data for synthetic data generation, pipelines or annotation tasks.

See an example datasetarrow-up-right created using this tool starting from a collection of plain text files.

The resulting text chunks are stored in a dataset that can be previewed and uploaded to the Hugging Face Hub for easy sharing and access by the community. The chunking is done using Llama-index's SentenceSplitterarrow-up-right classes.

Usage:

  • Login: Start by logging in to your Hugging Face account using the provided login button.

  • Set Parameters: Customize the chunk size and overlap according to your requirements.

  • Upload Files: Use the upload button to load file(s) for processing.

  • Preview Dataset: View the created dataset in a dataframe format before uploading it to the Hugging Face Hub.

  • Upload to Hub: Optionally, specify the Hub ID and choose whether to make the dataset private before pushing it to the Hugging Face Hub.

This Gradio app (https://lnkd.in/eKcHfyPsarrow-up-right) takes you from your local files to a chunked Hugging Facearrow-up-right Dataset (via LlamaIndexarrow-up-right) in one step! The goal of the tool is to make it quicker and easier to quickly get some local files you want to get ready for ML tasks into a Hugging Face Dataset. Perfect for building datasets for: - synthetic data pipelines - annotation - RAG - Other ML tasks that start from an HF dataset

Last updated