Text Annotation

Text annotation for NLP and document processing: a complete guide

Text annotation is widely used in organisations to solve NLP tasks for machine learning models. Learn more about text annotation in machine learning and how to employ the tool for better productivity!

What is Text Annotation in Machine Learning?

Simply put, text annotation in machine learning (ML) is the process of assigning labels to a digital file or document and its content. Labeling a text can consist in assigning tags to text attributes such as keywords, sentences, and paragraphs or simply classifying the text based on its content (i.e. text classification).

These include various NLP technologies like neural machine translation (NMT) programs, auto Q&A (question and answer) platforms, smart chatbots, sentiment analysis, text-to-speech synthesizers, and auto speech recognition (ASR) tools, among other related projects. These technologies can streamline the activities and transactions of many organizations across different industries.

If you want to learn more about the history of NLP in computer science, you can read our article here.

What are the different types of text annotations?

Text / Document classification

Text and document classification consists in attributing one or multiple attributes to a single text or full document.

Examples:

  1. 1. Classifying emails as spam or regular emails

  2. 2. Doing sentiment analysis on tweets

  3. 3. Labeling legal documents based on their content (legal notices, agreements, bonds, …)

Example of sentiment analysis on an amazon review

Bill of lading being classified

Named entity recognition

At a high level, named entity recognition is the action of identifying named entities within a text and assigning it a predefined category. Common categories that are used for this type of text annotation include names of organizations, locations, persons, numerical values, month or time and day of the week, etc. But depending on the type of NER performed, categories such as paragraph, title, and content can also be used.

NER performed on a short article from Reuters

Entity linking

A table being extracted from scanned document

A table being extracted from scanned document

Layout analysis

Layout analysis consists of labeling document structures to transform them into another format (ex: JSON).

Layout analysis consists of labeling document structures to transform them into another format (ex: JSON).

How to label texts, PDFs, and Images?

In real life, textual data exists under a wide range of different formats txt, pdf or even text in images or scanned documents. In this part are going to dive deep into the specificities of those data formats and what features are mandatory to label efficiently.

Labeling text data

When labeling text data in simple txt format, the following features are important:

Multilingual Support

Annotated English Document

Annotated Arabic Document

Annotated Chinese-speaking Document

Last updated