Introduction

The GermEval 2014 text corpus [BEN14, BEN14a] contains german sentences annotated with Named Entity Recognition (NER) tags. It can be used as training data for language models, as well as NER tagging algorithms. The corpus is split into three different files for training, test and validation data.

This notebook shows how to load the corpus using a data manager class from the deep-teaching-commons Python module and gives explanations of the data.

Required Knowledge

To learn more about Named Entity Recognition, read the introduction to the deep.TEACHING Text Information Extraction scenario.

Required Python Modules

# Python Standard Library
from pprint import pprint

# External Modules
from deep_teaching_commons.data.text.germ_eval_2014 import GermEval2014

Data Exploration

# create a data manager to download and read the text corpus
dm = GermEval2014()
auto download is active, attempting download
data directory already exists, no download required
# call for help to learn about the data manager
help(GermEval2014)
Help on class GermEval2014 in module deep_teaching_commons.data.text.germ_eval_2014:

class GermEval2014(builtins.object)
 |  Methods defined here:
 |  
 |  __init__(self, base_data_dir=None, data_url=None, auto_download=True, verbose=True)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  download(self)
 |  
 |  test_sequences(self)
 |  
 |  train_sequences(self)
 |  
 |  val_sequences(self)
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  sequences(data_file)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Taking a Look at the Training Data

# Counting all sequences
sequence_count = 0
for _ in dm.train_sequences():
    sequence_count += 1

'The GermEval 2014 training corpus contains {} sentences.'.format(sequence_count)

Calling dm.train_sequences() creates a Python generator. This generator iterates through the training data text corpus by lazily loading one line at a time. Therefore this generator has a low memory footprint. A generator can only be used once and needs to be reinstantiated to start iterating from the beginning.

Other generators provided by the data manager are dm.test_sequences() and dm.val_sequences() for test and validation data respectively.

# Printing the first four sequences
for i, s in enumerate(dm.train_sequences()):
    if i == 4:
        break
    print('Sequence {}'.format(i+1))
    pprint(s)
    print()
Sequence 1
[('Schartau', 'B-PER', 'O'),
 ('sagte', 'O', 'O'),
 ('dem', 'O', 'O'),
 ('"', 'O', 'O'),
 ('Tagesspiegel', 'B-ORG', 'O'),
 ('"', 'O', 'O'),
 ('vom', 'O', 'O'),
 ('Freitag', 'O', 'O'),
 (',', 'O', 'O'),
 ('Fischer', 'B-PER', 'O'),
 ('sei', 'O', 'O'),
 ('"', 'O', 'O'),
 ('in', 'O', 'O'),
 ('einer', 'O', 'O'),
 ('Weise', 'O', 'O'),
 ('aufgetreten', 'O', 'O'),
 (',', 'O', 'O'),
 ('die', 'O', 'O'),
 ('alles', 'O', 'O'),
 ('andere', 'O', 'O'),
 ('als', 'O', 'O'),
 ('überzeugend', 'O', 'O'),
 ('war', 'O', 'O'),
 ('"', 'O', 'O'),
 ('.', 'O', 'O')]

Sequence 2
[('Firmengründer', 'O', 'O'),
 ('Wolf', 'B-PER', 'O'),
 ('Peter', 'I-PER', 'O'),
 ('Bree', 'I-PER', 'O'),
 ('arbeitete', 'O', 'O'),
 ('Anfang', 'O', 'O'),
 ('der', 'O', 'O'),
 ('siebziger', 'O', 'O'),
 ('Jahre', 'O', 'O'),
 ('als', 'O', 'O'),
 ('Möbelvertreter', 'O', 'O'),
 (',', 'O', 'O'),
 ('als', 'O', 'O'),
 ('er', 'O', 'O'),
 ('einen', 'O', 'O'),
 ('fliegenden', 'O', 'O'),
 ('Händler', 'O', 'O'),
 ('aus', 'O', 'O'),
 ('dem', 'O', 'O'),
 ('Libanon', 'B-LOC', 'O'),
 ('traf', 'O', 'O'),
 ('.', 'O', 'O')]

Sequence 3
[('Ob', 'O', 'O'),
 ('sie', 'O', 'O'),
 ('dabei', 'O', 'O'),
 ('nach', 'O', 'O'),
 ('dem', 'O', 'O'),
 ('Runden', 'O', 'O'),
 ('Tisch', 'O', 'O'),
 ('am', 'O', 'O'),
 ('23.', 'O', 'O'),
 ('April', 'O', 'O'),
 ('in', 'O', 'O'),
 ('Berlin', 'B-LOC', 'O'),
 ('durch', 'O', 'O'),
 ('ein', 'O', 'O'),
 ('pädagogisches', 'O', 'O'),
 ('Konzept', 'O', 'O'),
 ('unterstützt', 'O', 'O'),
 ('wird', 'O', 'O'),
 (',', 'O', 'O'),
 ('ist', 'O', 'O'),
 ('allerdings', 'O', 'O'),
 ('zu', 'O', 'O'),
 ('bezweifeln', 'O', 'O'),
 ('.', 'O', 'O')]

Sequence 4
[('Bayern', 'B-ORG', 'B-LOC'),
 ('München', 'I-ORG', 'B-LOC'),
 ('ist', 'O', 'O'),
 ('wieder', 'O', 'O'),
 ('alleiniger', 'O', 'O'),
 ('Top-', 'O', 'O'),
 ('Favorit', 'O', 'O'),
 ('auf', 'O', 'O'),
 ('den', 'O', 'O'),
 ('Gewinn', 'O', 'O'),
 ('der', 'O', 'O'),
 ('deutschen', 'B-LOCderiv', 'O'),
 ('Fußball-Meisterschaft', 'O', 'O'),
 ('.', 'O', 'O')]

Explaining the Data

As can be seen in the output above, each sequence (sentence) is parsed as a list of triples, where each triple consists of a word, an outer NER tag and an inner NER tag.

Having two types of NER tags is a speciality of the GermEval database is best explained with an example: Sequence 4 starts with the terms Bayern and München. In the context of this sentence, they form a single named entity ORG (organisation). Therefore the outer NER tag of Bayern is B-ORG (B marks a tag at the beginning of an entity) and the outer NER tag of München is I-ORG (I marks a tag inside an entity). Since both, Bayern and München, are also named entities by themselves, their inner NER tags are B-LOC (location).

The concept of using inner and outer NER tags is advanced and new learners should focus on the outer NER tags.

# list all NER tags
ner_tags = set()
for s in dm.train_sequences():
    ner_tags.update([t[1] for t in s] + [t[2] for t in s])

sorted(list(ner_tags))

Glossary

Index Description Example
LOC Location "Deutschland"
ORG Organisation "ARD"
OTH Other "Euro"
PER Person "Ronaldo"
O not an entity "nicht"
part entity is part of concatenated terms "ARD-Programmchef" is "ORGpart", because "ARD" is "ORG"
deriv entity is a derived term "deutsche" is "LOCderiv", because "Deutschland" is "LOC"

Summary and Outlook

You have learned how to read the GermEval 2014 text corpus using the deep-teaching-commons Python module. Take a look at the Exercise: Bi-Gram Language Model notebook for a usage example.

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g. images).

Text Information Extraction - Data Exploration - Germ Eval 2014
by Christoph Jansen (deep.TEACHING - HTW Berlin)
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christoph Jansen

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.