Title:SoNaR Corpus
Bibliographic Citation:http://hdl.handle.net/11372/LRT-1498
Creator:Radboud University, CLST
Tilburg University, ILK
University of Twente, HMI
University College Ghent, Faculty of Translation Studies
KU Leuven, CCL
Utrecht University, UiL OTS
Date (W3CDTF):2015-06-29T13:23:32Z
Date Available:2015-06-29T13:23:32Z
Description:The SoNaR-corpus is a 500-million-word reference corpus of contemporary written Dutch and it consists of two parts, viz. SoNaR500 and SONAR1. SONAR500 contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types. All texts were tokenized, POS-tagged and lemmatized. The named entities were labelled. All annotations in SoNaR500 were automatically generated. SONAR1 is largely a subset of SONAR500 and contains 1 million words. SONAR1 was enriched with various types of semantic annotations, viz. named entity labeling, coreference resolution and annotation of spatial and temporal expressions and of semantic roles. All annotations in SONAR1 were manually verified. The new media texts (tweets, chats and SMS), which were also collected during the STEVIN project SONAR are not part of the SoNaR corpus. They are separately distributed as the SoNaR New Media Corpus.
Identifier (URI):http://hdl.handle.net/11372/LRT-1498
Language (ISO639):nld
Publisher:Dutch-Flemish HLT Agency
Subject:monolingual corpus
annotated corpus
written language
Type (DCMI):Text
Type (OLAC):primary_text


Citation: Radboud University, CLST; Tilburg University, ILK; University of Twente, HMI; University College Ghent, Faculty of Translation Studies; KU Leuven, CCL; Utrecht University, UiL OTS. 2015. Dutch-Flemish HLT Agency.
