Integración de embeddings de nueva generación y recursos lingüísticos actuales para identificar palabras complejas en español con machine learning ; Integration of new generation embeddings and current linguistic resources to identify complex words in Spanish with machine learning

Item request has been placed!

Item request cannot be made.

Processing Request

Read More Add to Saved list

Author(s): Mera Dávila, Luis Iván
Source:
Revista Peruana de Computación y Sistemas; Vol. 6 No. 2 (2024); 55-64 ; Revista peruana de computación y sistemas; Vol. 6 Núm. 2 (2024); 55-64 ; 2617-2003
Subject Terms:
Complex word identification; Embeddings; Lexical Simplification; Spanish; Identificación de palabras complejas; Simplificación Léxica; Español
Document Type:
article in journal/newspaper
Language:
Spanish; Castilian

Additional Information
- Publication Information:
  Universidad Nacional Mayor de San Marcos, Facultad de Ingeniería de Sistemas e Informática
- Publication Date:
  2024
- Collection:
  Universidad Nacional Mayor de San Marcos: Revistas de investigación UNMSM
- Abstract:
  The complexity of words can pose a limitation to the accessibility of information, which could affect millions of Spanish-speaking people. The objective of this study is to develop a machine learning model for the binary task of identifying complex words in Spanish, using next-generation embeddings, current linguistic resources, and lexical properties. To this end, the Spanish dataset from the CWI Shared Task 2018 was used, obtaining embeddings generated by the text-embedding-3-large model and word frequencies extracted from resources such as the Corpus del Español del Siglo XXI, the Corpus de Referencia del Español Actual, the Spanish Billion Word Corpus and Embeddings, and Wordfreq. To select features and find their best combination, a 5-fold cross-validation using XGBClassifier was employed. After comparing several machine learning algorithms, the final model, based on LGBMClassifier, achieved a macro F1 score of 0.7993, surpassing the best team from that competition, more recent studies that used neural networks, and some large language models. This demonstrates the potential of these resources that are constantly being updated and that can contribute to improving the accuracy of this task. ; La complejidad de las palabras puede suponer una limitación para la accesibilidad de la información, lo que podría afectar a millones de personas hispanohablantes. El objetivo de este estudio es desarrollar un modelo de machine learning para la tarea binaria de identificación de palabras complejas en español, usando embeddings de nueva generación, recursos lingüísticos actuales y propiedades léxicas. Para ello se empleó el conjunto de datos en español de la tarea compartida CWI Shared Task 2018, obteniendo embeddings generados por el modelo text-embedding-3-large y frecuencias de palabras extraídas de recursos como el Corpus del Español del Siglo XXI, el Corpus de Referencia del Español Actual, el Spanish Billion Word Corpus and Embeddings y Wordfreq. Para seleccionar características y encontrar su mejor combinación se ...
- File Description:
  application/pdf
- Relation:
  https://revistasinvestigacion.unmsm.edu.pe/index.php/index/article/view/29211/21732; https://revistasinvestigacion.unmsm.edu.pe/index.php/index/article/view/29211
- Accession Number:
  10.15381/rpcs.v6i2.29211
- Online Access:
  https://revistasinvestigacion.unmsm.edu.pe/index.php/index/article/view/29211
  https://doi.org/10.15381/rpcs.v6i2.29211
- Rights:
  Derechos de autor 2024 Luis Iván Mera Dávila ; https://creativecommons.org/licenses/by/4.0
- Accession Number:
  edsbas.66687024

Comments

No Comments.

Integración de embeddings de nueva generación y recursos lingüísticos actuales para identificar palabras complejas en español con machine learning ; Integration of new generation embeddings and current linguistic resources to identify complex words in Spanish with machine learning

Contact

Follow us