1 dataset found
  1. E

    Data from: Slovenian datasets for contextual synonym and antonym detection

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Oct 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Slovenian datasets for contextual synonym and antonym detection [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/20526
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Oct 25, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Slovenian datasets for contextual synonym and antonym detection can be used for training machine learning classifiers as described in the MSc thesis of Jasmina Pegan "Semantic detection of synonyms and antonyms with contextual embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=141456). Datasets contain example pairs of synonyms and antonyms in contexts together with additional information on a sense pair. Candidates for synonyms and antonyms were retrieved from the dataset created in the BSc thesis of Jasmina Pegan "Antonym detection with word embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=110533). Example sentences were retrieved from The comprehensive Slovenian-Hungarian dictionary (VSMS) (https://www.clarin.si/repository/xmlui/handle/11356/1453). Each dataset is class balanced and contains an equal amount of examples and counterexamples. An example is a pair of example sentences where the two words are synonyms/antonyms. A counterexample is a pair of example sentences where two words are not synonyms/antonyms. Note that a word pair can be synonymous or antonymous in some sense of the two words (but not in the given context).

    Datasets are divided into two categories, datasets for synonyms and datasets for antonyms. Each category is further divided into base and updated datasets. These contain three dataset files: train, validation and test dataset. Base datasets include only manually-reviewed sense pairs. These are generated from all pairs of VSMS sense examples for all confirmed pairs of antonym and synonym senses. Updated datasets include automatically generated sense pairs while constraining the maximal number of examples per word. In this way, the dataset is more balanced word-wise, but is not fully manually-reviewed and contains less accurate data.

    A single dataset entry contains the information on the base word, followed by data on synonym/antonym candidate. The last column discerns whether the sense pair is a pair of synonyms/antonyms or not. More details on this can be found inside the included README file.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). Slovenian datasets for contextual synonym and antonym detection [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/20526

Data from: Slovenian datasets for contextual synonym and antonym detection

Related Article
Explore at:
binary formatAvailable download formats
Dataset updated
Oct 25, 2022
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Slovenian datasets for contextual synonym and antonym detection can be used for training machine learning classifiers as described in the MSc thesis of Jasmina Pegan "Semantic detection of synonyms and antonyms with contextual embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=141456). Datasets contain example pairs of synonyms and antonyms in contexts together with additional information on a sense pair. Candidates for synonyms and antonyms were retrieved from the dataset created in the BSc thesis of Jasmina Pegan "Antonym detection with word embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=110533). Example sentences were retrieved from The comprehensive Slovenian-Hungarian dictionary (VSMS) (https://www.clarin.si/repository/xmlui/handle/11356/1453). Each dataset is class balanced and contains an equal amount of examples and counterexamples. An example is a pair of example sentences where the two words are synonyms/antonyms. A counterexample is a pair of example sentences where two words are not synonyms/antonyms. Note that a word pair can be synonymous or antonymous in some sense of the two words (but not in the given context).

Datasets are divided into two categories, datasets for synonyms and datasets for antonyms. Each category is further divided into base and updated datasets. These contain three dataset files: train, validation and test dataset. Base datasets include only manually-reviewed sense pairs. These are generated from all pairs of VSMS sense examples for all confirmed pairs of antonym and synonym senses. Updated datasets include automatically generated sense pairs while constraining the maximal number of examples per word. In this way, the dataset is more balanced word-wise, but is not fully manually-reviewed and contains less accurate data.

A single dataset entry contains the information on the base word, followed by data on synonym/antonym candidate. The last column discerns whether the sense pair is a pair of synonyms/antonyms or not. More details on this can be found inside the included README file.

Search
Clear search
Close search
Google apps
Main menu