# Corpus Creator

## Datasets and Corpus Creator&#x20;

*📁 From random files to a Hugging Face dataset in a single step 📁*

{% embed url="<https://huggingface.co/spaces/davanstrien/corpus-creator>" %}

Corpus Creator is a tool designed to help you easily convert a collection of text files into a dataset suitable for various natural language processing (NLP) tasks.&#x20;

In particular the app is focused on splitting texts into chunks of a specified size and overlap. This can be useful for preparing data for synthetic data generation, pipelines or annotation tasks.

See an [example dataset](https://huggingface.co/datasets/davanstrien/MOH-Bethnal-Green) created using this tool starting from a collection of plain text files.

The resulting text chunks are stored in a dataset that can be previewed and uploaded to the Hugging Face Hub for easy sharing and access by the community. The chunking is done using `Llama-index`'s [`SentenceSplitter`](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/?h=sentencesplitter#sentencesplitter) classes.

#### Usage:

* Login: Start by logging in to your Hugging Face account using the provided login button.
* Set Parameters: Customize the chunk size and overlap according to your requirements.
* Upload Files: Use the upload button to load file(s) for processing.
* Preview Dataset: View the created dataset in a dataframe format before uploading it to the Hugging Face Hub.
* Upload to Hub: Optionally, specify the Hub ID and choose whether to make the dataset private before pushing it to the Hugging Face Hub.

This Gradio app (<https://lnkd.in/eKcHfyPs>) takes you from your local files to a chunked [Hugging Face](https://www.linkedin.com/company/huggingface/) Dataset (via [LlamaIndex](https://www.linkedin.com/company/llamaindex/)) in one step!\
\
The goal of the tool is to make it quicker and easier to quickly get some local files you want to get ready for ML tasks into a Hugging Face Dataset. Perfect for building datasets for:\
\- synthetic data pipelines\
\- annotation\
\- RAG\
\- Other ML tasks that start from an HF dataset


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://metaverse-imagen.gitbook.io/ai-tools-research/about-ai-tools-research/ai-adoption-consultation-and-training-services/training-services/corpus-creator.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
