Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Deltacorpus 1.1

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Author(s): Mareček, David; Yu, Zhiwei; Zeman, Daniel; Žabokrtský, Zdeněk
  • Subject Terms:
  • Document Type:
    text
  • Language:
    Belarusian
    Bosnian
    Bulgarian
    Czech
    Croatian
    Sorbian languages
    Macedonian
    Polish
    Russian
    Slovak
    Slovenian
    Serbian
    Ukrainian
    Latvian
    Lithuanian
    Afrikaans
    Danish
    German
    English
    Faroese
    Western Frisian
    Swiss German; Alemannic; Alsatian
    Icelandic
    Low German; Low Saxon; German, Low; Saxon, Low
    Dutch; Flemish
    Norwegian Nynorsk; Nynorsk, Norwegian
    Norwegian
    Scots
    Swedish
    Yiddish
    Aragonese
    Asturian; Bable; Leonese; Asturleonese
    Catalan; Valencian
    French
    Galician
    Haitian; Haitian Creole
    Italian
    Latin
    Portuguese
    Romanian; Moldavian; Moldovan
    Spanish; Castilian
    Breton
    Welsh
    Gaelic; Scottish Gaelic
    Irish
    Greek, Modern (1453-)
    Armenian
    Albanian
    Persian
    Kurdish
    Tajik
    Bengali
    Gujarati
    Hindi
    Marathi
    Nepali
    Urdu
    Amharic
    Arabic
    Egyptian (Ancient)
    Hebrew
    Estonian
    Finnish
    Hungarian
    Basque
    Georgian
    Chuvash
    Azerbaijani
    Turkish
    Uzbek
    Tatar
    Yakut
    Korean
    Mongolian
    Telugu
    Kannada
    Malayalam
    Tamil
    Nepal Bhasa; Newari
    Vietnamese
    Indonesian
    Javanese
    Malagasy
    Maori
    Malay
    Tagalog
    Waray
    Swahili
    Esperanto
    Interlingua (International Auxiliary Language Association)
  • Additional Information
    • Publication Information:
      Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
    • Publication Date:
      2016
    • Collection:
      OLAC: Open Language Archives Community
    • Abstract:
      Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). Changes in version 1.1: 1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset. 2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
    • Relation:
      http://hdl.handle.net/11234/1-1662; http://hdl.handle.net/11234/1-1743
    • Online Access:
      http://hdl.handle.net/11234/1-1743
    • Rights:
      Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ; http://creativecommons.org/licenses/by-sa/4.0/
    • Accession Number:
      edsbas.E919EEFC