Automatic inflection in Czech language ; Automatické skloňování v češtině

Item request has been placed!

Item request cannot be made.

Processing Request

Read More Add to Saved list

Author(s): Sourada, Tomáš
Subject Terms:
automatické skloňování|morfologie|generování přirozeného jazyka|čeština|skloňování|flexe|neslovníková slova; automatic inflection|morphology|natural language generation|Czech language|inflection|declension|morphological inflection|out-of-vocabulary words
Document Type:
bachelor thesis
Language:
English

Additional Information
- Contributors:
  Rosa, Rudolf; Vidra, Jonáš; Straková, Jana
- Publication Information:
  Univerzita Karlova, Matematicko-fyzikální fakulta
- Publication Date:
  2023
- Collection:
  Charles University: CU Digital repository / Univerzita Karlova: Digitální repozitář UK
- Abstract:
  This thesis focuses on the task of automatic morphological inflection of Czech nouns, specifically in out-of-vocabulary (OOV) conditions (inflecting previously unseen words). We automatically extracted a large dataset suit- able for training and evaluation in the OOV conditions. We also manually built a real-world OOV dataset of neologisms. We developed three different systems: a retrograde model performing a variation of kNN algorithm, and two sequence-to-sequence (seq2seq) models based on LSTM and Transformer. Compared to an available rule-based inflection system sklonuj.cz and stan- dard SIGMORPHON shared task baselines, our seq2seq model reaches the best results in the standard OOV conditions. Moreover, it achieves state-of- the-art results for 6 out of 16 development languages from SIGMORPHON 2022 shared task data in the OOV evaluation (feature overlap) on large data condition. On the real-world OOV dataset, the retrograde model outper- forms all neural models and is competitive with a non-neural SIGMORPHON baseline. We release the inflection system with seq2seq model as a ready-to- use Python library. It could serve as a complement to the state-of-the-art dictionary-based inflection system MorphoDiTa as a back-off for OOV words, especially once extended to other parts of speech. 1 ; Tato bakalářská práce se zaměřuje na automatické skloňování českých podstatných jmen, zejména slov, která nejsou zahrnuta ve slovníku (tzv. out- of-vocabulary, OOV) - skloňovánípředem neviděných slov. Automaticky jsme extrahovali rozsáhlý dataset vhodný pro trénování a vyhodnocení za OOV podmínek. Dále jsme manuálně vytvořili dataset vyskloňovaných reálných OOV slov - neologismů. Vyvinuli jsme tři různé systémy: retrográdní model založený na algoritmu k-nejbližších sousedů (kNN) a dva modely sequence- to-sequence (seq2seq) založené na LSTM a Transformeru. V porovnání se stávajícím skloňovacím systémem sklonuj.cz a standardními baseline systémy ze SIGMORPHON shared tasks jsme za OOV podmínek s naším seq2seq mo- delem dosáhli ...
- File Description:
  application/pdf; application/zip
- Relation:
  http://hdl.handle.net/20.500.11956/184286; 253748
- Online Access:
  https://doi.org/20.500.11956/184286
  https://hdl.handle.net/20.500.11956/184286
- Accession Number:
  edsbas.F8B24AB3

Comments

No Comments.

Automatic inflection in Czech language ; Automatické skloňování v češtině

Contact

Follow us