OLAC Record
oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/ILC-984

Metadata
Title:TrAVaSI_GDLI-quotation corpus
Bibliographic Citation:http://hdl.handle.net/20.500.11752/ILC-984
Creator:Favaro, Manuel
Guadagnini, Elisa
Sassolini, Eva
Biffi, Marco
Montemagni, Simonetta
Date (W3CDTF):2023-01-09T08:40:48Z
Date Available:2023-01-09T08:40:48Z
Description:The TrAVaSI_GDLI-quotation corpus (TrAVaSI_GDLI-QC) is a first nucleus of a diachronic corpus for Italian collecting a sample of the quotations of a historical dictionary, namely the "Grande Dizionario della Lingua Italiana" (GDLI) by Salvatore Battaglia, which includes a huge collection of quotations covering the entire history of the Italian language, ranging from the Middle Ages to the present day. Different criteria guided the composition of the corpus. Among the most cited authors, those who guaranteed to cover the widest chronological span were selected. Representativeness of different text typologies (e.g. chronicle, literary prose, poetry, treatises) was also taken into account. The resulting TrAVaSI_GDLI-QC consists of two balanced sub-corpora, with quotations from works written between 14th and 20th century: one collecting 1500 prose quotes from 15 authors (100 each) for a total of about 35.000 tokens, and the other gathering 500 poetry quotes from 10 authors (50 each) for a total of about 10.000 tokens. TrAVaSI_GDLI-QC is morpho-syntactically annotated and lemmatized. The annotation, conforming to the Universal Dependencies standard (UD, De Marneffe et al. 2021), has been carried out semi-automatically. First, both sub-corpora were automatically annotated with the Stanza “combined” model for Italian. Automatic annotation was then manually revised. The resulting corpus has also been used to retrain Stanza to deal with historical varieties of the Italian language: achieved results are encouraging.
Identifier (URI):http://hdl.handle.net/20.500.11752/ILC-984
Language:Italian
Language (ISO639):ita
Publisher:Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR)
Accademia della Crusca
Rights:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
http://creativecommons.org/licenses/by-nc-nd/4.0/
Subject:historical annotated corpora
linguistic annotation
Universal Dependencies
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", National Research Council, in Pisa
Description:  http://www.language-archives.org/archive/dspace-clarin-it.ilc.cnr.it
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/ILC-984
DateStamp:  2023-01-09
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Favaro, Manuel; Guadagnini, Elisa; Sassolini, Eva; Biffi, Marco; Montemagni, Simonetta. 2023. Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR).
Terms: area_Europe country_IT dcmi_Text iso639_ita olac_primary_text


http://www.language-archives.org/item.php/oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/ILC-984
Up-to-date as of: Tue Sep 19 0:43:06 EDT 2023