Text Information Extraction Scenario

Human communication more and more relies on digital text exchange, like websites, emails and chat systems. Although humans can easily interprete a text written in a natural language they’ve learned, the vast amount of data to read and scan for useful information is simply overwhelming. Therefore, during the last decades, many advances have been made to use machines for automatic classification, interpretation and translation of textual data.

The most prominent Natural Language Processing (NLP) systems are search engines, since many people use them on a daily basis. But correctly interpreting a users search query and the content of websites to provide the most relevant search results is a major challenge. For example a user, who is looking for the final score of last nights’ soccer game, could formulate the search query “Bayern München score”. The search engine must then be capable to interprete “Bayern München” as an organisation and not the two locations “Bayern” and “München”. In this case, the word “score” is a good hint for a search engine, to correctly tag the entity “Bayern München”.

Named Entity Recognition

The search engine example describes a common NLP information extraction task called Named Entity Recognition (NER), where a sentence or sequence of words is given to an algorithm as input, which then produces a sequence of NER tags as output. The NER problem can be approached from two angles: A computer linguist can program a complex set of handwritten rules (e.g. regular expressions) to tag words. Creating this kind of expert system requires a lot of human labor, because every sentence is different and each word can occur in indefinite contexts. Therefore we focus on a more general approach based on machine learning techniques, where an algorithm learns to solve this problem from an existing set of training data. A well trained and generalized machine learning system will then be able to create a correct output for a given sentence, even if it was not included in the training data. The requirement for this to be successful, is a large text corpus of sentences, where each word has already correctly been annotated with its corresponding named entity tag.

Named Entity Tags

The following sentence is a simplified example from the GermEval 2014 corpus:

TermTag
BayernORG
MünchenORG
istO
wiederO
alleinigerO
Top-O
FavoritO
aufO
denO
GewinnO
derO
deutschenLOC
Fußball-MeisterschaftO
.O

In this example each term of the sentence is annotated with a named entity tag. Both terms “Bayern” and “München” are tagged with ORG (organisation) as expected in this context. The term “deutschen” is tagged as LOC (location). Every term not representing a named entity is tagged as O. This notation scheme is called NER-IO.

With NER-IO it is indistiguishable if “Bayern München” is a single ORG entity or two separate entities. The following example shows the NER-IOB notation, which solves this problem.

TermTag
BayernB-ORG
MünchenI-ORG
istO
wiederO
alleinigerO
Top-O
FavoritO
aufO
denO
GewinnO
derO
deutschenB-LOC
Fußball-MeisterschaftO
.O

In IOB-Notation, each B- marks the beginning and each I- marks the continuation of a tag.

Algorithms

Named Entity Recognition is a sequence-to-sequence learning problem, because each sentence can be seen as a sequence of terms, which should be automatically tagged with an equally long sequence of NER tags.

Algorithms tackling this problem include Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM) and Conditional Random Fields (CRF), which are based on bayesian statistics (counting occurrences of term sequences to calculate a probabilistic model), as well as Deep Learning approaches like Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN).

The deep.TEACHING project provides educational material for students to gain basic knowledge about the problem domain, the programming, math and statistics requirements, as well as the mentioned algorithms and their evaluation. Students will also learn how to construct complex machine learning systems, which can incorporate several algorithms at once.

Learning about Named Entity Recognition and applicable algorithms, will provide students with knowledge transferable to similar problems. Similar NLP tasks include Part-of-Speech tagging (tagging terms as nouns, verbs, etc.) and sentence splitting (tagging the end of a sequence). Besides text, sequence learning is also applicable to signal processing (e.g. detecting sleep stages in sleep medical biosignals), audio data (e.g. text-to-speech) and various other fields.

NER Corpora

CorpusLanguageSamplesSourceInfo
GermEval 2014German31302LinkNotebook
CoNLL 2002Spanish11755LinkAvailable in NLTK (Python)
CoNLL 2002Netherlandish23896LinkAvailable in NLTK (Python)
Annotated Corpus for Named Entity RecognitionEnglish47959LinkDownload requires Kaggle account

Educational Materials

Work in Progress.

Machine Learning Fundamentals

  • Graphical Models

    • Markov Models (see Exercise: Bi-Gram Language Model)
    • Hidden Markov Models (HMM)
    • Maximum Entropy Markov Models (MEMM)
    • Linear-Chain Conditional Random Fields (CRF)
  • Linear Models

    • Linear Regression
    • Logistic Regression
  • Neural Networks

    • Feed Forward Artificial Neural Networks (ANN)
    • Backpropagation
    • Convolutional Neural Networks (CNN)
    • Recurrent Neural Networks (RNN)
    • Gated Recurrent Unit (GRU)
    • Long Short-Term Memory (LSTM)

Text Information Extraction

  • Sequences

    • Exercise: Bi-Gram Language Model
    • Exercise: RNN for Character Language Model
    • Prerequisites: Recurrent Neural Networks
    • Exercise: CNN for Sequences
    • Prerequisites: Convolutional Neural Networks
  • Word Vectors

    • Skip-Grams
    • Exercise: Continuous Bag of Words (CBOW)
    • Prerequisites: Feed Forwared Artificial Neural Networks
  • Sequence Tagging

    • Exercise: HMM for Named Entity Recognition
    • Prerequisites: Hidden Markov Models
    • Exercise: Bi-LSTM and CRF for Named Entity Recognition
    • Prerequisites: Long Short-Term Memory, Conditional Random Fields