OLAC Record oai:lindat.mff.cuni.cz:11372/LRT-1230 |
Metadata | ||
Title: | LX-Tokenizer | |
Bibliographic Citation: | http://hdl.handle.net/11372/LRT-1230 | |
Contributor: | Branco, António | |
Silva, João | ||
Date (W3CDTF): | 2014-07-30T21:28:17Z | |
Date Available: | 2014-07-30T21:28:17Z | |
Description: | Automatic segmenter of lexemes of Portuguese. Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more cleary. um exemplo → |um|exemplo| Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol: do → |de_|o| Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively: um, dois e três → |um|,*/|dois|e|três| 5.3 → |5|.|3| 1. 2 → |1|.*/|2| 8 . 6 → |8|\*.*/|6| Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol: dá-se-lho → |dá|-se|-lhe|-o| afirmar-se-ia → |afirmar-CL-ia|-se| vê-las → |vê#|-las| This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance: deste → |deste| when occurring as a Verb deste → |de|este| when occurring as a contraction (Preposition + Demonstrative) This tool achieves a f-score of 99.72%. | |
Identifier (URI): | http://hdl.handle.net/11372/LRT-1230 | |
Language: | Portuguese | |
Language (ISO639): | por | |
Publisher: | NLX-Natural Language and Speech Group, University of Lisbon | |
Type: | toolService | |
Type (DCMI): | Software | |
OLAC Info |
||
Archive: | LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University | |
Description: | http://www.language-archives.org/archive/lindat.mff.cuni.cz | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:lindat.mff.cuni.cz:11372/LRT-1230 | |
DateStamp: | 2016-04-06 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Branco, António; Silva, João. 2014. NLX-Natural Language and Speech Group, University of Lisbon. | |
Terms: | area_Europe country_PT dcmi_Software iso639_por |