37 datasets found
  1. h

    stsb_multi_mt

    • huggingface.co
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip May (2024). stsb_multi_mt [Dataset]. https://huggingface.co/datasets/PhilipMay/stsb_multi_mt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Authors
    Philip May
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for STSb Multi MT

      Dataset Summary
    

    STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums. (source)

    These are different multilingual translations and the English original of the STSbenchmark dataset. Translation has been done with deepl.com. It can be used to train sentence embeddings… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/stsb_multi_mt.

  2. Z

    Data from: GiCCS: A German in-Context Conversational Similarity Benchmark

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asaadi, Shima (2022). GiCCS: A German in-Context Conversational Similarity Benchmark [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7266220
    Explore at:
    Dataset updated
    Oct 31, 2022
    Dataset provided by
    Kolagar, Zahra
    Asaadi, Shima
    Zarcone, Alessandra
    Liebel, Alina
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Description

    We introduce GiCCS, a first conversational STS evaluation benchmark for German. We collected the similarity annotations for GiCCS using best-worst scaling and presenting the target items in context, in order to obtain highly-reliable context-dependent similarity scores. In our paper, we present benchmarking experiments for evaluating LMs on capturing the similarity of utterances. Results suggest that pretraining LMs on conversational data and providing conversational context can be useful for capturing similarity of utterances in dialogues. GiCCS will be publicly available to encourage benchmarking of conversational LMs.

  3. O

    STS Benchmark

    • opendatalab.com
    • huggingface.co
    zip
    Updated Sep 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2022). STS Benchmark [Dataset]. https://opendatalab.com/OpenDataLab/STS_Benchmark
    Explore at:
    zip(2640708 bytes)Available download formats
    Dataset updated
    Sep 11, 2022
    Dataset provided by
    George Washington University
    University of the Basque Country
    Google Research
    University of Sheffield
    Description

    STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

  4. h

    ro_sts

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dumitrescu Stefan (2025). ro_sts [Dataset]. https://huggingface.co/datasets/dumitrescustefan/ro_sts
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Dumitrescu Stefan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The RO-STS (Romanian Semantic Textual Similarity) dataset contains 8628 pairs of sentences with their similarity score. It is a high-quality translation of the STS benchmark dataset.

  5. h

    stsb

    • huggingface.co
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2024). stsb [Dataset]. https://huggingface.co/datasets/sentence-transformers/stsb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2024
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for STSB

    The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5. However, for this variant, the similarity scores are normalized to between 0 and 1.

      Dataset Details
    

    Columns: "sentence1", "sentence2", "score" Column types: str, str, float Examples:{ 'sentence1': 'A… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/stsb.

  6. h

    sts-ca

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Projecte Aina, sts-ca [Dataset]. https://huggingface.co/datasets/projecte-aina/sts-ca
    Explore at:
    Dataset authored and provided by
    Projecte Aina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for STS-ca

      Dataset Summary
    

    STS-ca corpus is a benchmark for evaluating Semantic Text Similarity in Catalan. This dataset was developed by BSC TeMU as part of Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB). This work is licensed under a Attribution-ShareAlike 4.0 International License.

      Supported Tasks and Leaderboards
    

    This dataset can be used to build and score semantic similarity models in Catalan.… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/sts-ca.

  7. E

    Data from: Semantic Textual Similarity in Catalan

    • live.european-language-grid.eu
    • observatorio-cientifico.ua.es
    • +2more
    tsv
    Updated Oct 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Semantic Textual Similarity in Catalan [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7869
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Oct 3, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    STS corpus is a benchmark for evaluating Semantic Text Similarity in Catalan.It consists of 3079 sentence pairs, annotated with the semantic similarity between them, using a scale from 0 (no similarity at all) to 5 (semantic equivalence). It is done manually by 4 different people following our guidelines based on previous work from the SemEval challenges (https://www.aclweb.org/anthology/S13-1004.pdf). This dataset was developed by BSC TeMU as part of the AINA project.

  8. h

    sickr-sts

    • huggingface.co
    Updated Apr 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). sickr-sts [Dataset]. https://huggingface.co/datasets/mteb/sickr-sts
    Explore at:
    Dataset updated
    Apr 27, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    SICK-R An MTEB dataset Massive Text Embedding Benchmark

    Semantic Textual Similarity SICK-R dataset

    Task category t2t

    Domains Web, Written

    Reference https://aclanthology.org/L14-1314/

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["SICK-R"]) evaluator = mteb.MTEB(task)

    model = mteb.get_model(YOUR_MODEL) evaluator.run(model)

    To learn more about how to run models… See the full description on the dataset page: https://huggingface.co/datasets/mteb/sickr-sts.

  9. h

    sts17-crosslingual-sts

    • huggingface.co
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). sts17-crosslingual-sts [Dataset]. https://huggingface.co/datasets/mteb/sts17-crosslingual-sts
    Explore at:
    Dataset updated
    Jun 29, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    STS17 An MTEB dataset Massive Text Embedding Benchmark

    Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation

    Task category t2t

    Domains News, Web, Written

    Reference https://alt.qcri.org/semeval2017/task1/

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["STS17"]) evaluator = mteb.MTEB(task)

    model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/sts17-crosslingual-sts.

  10. Sentence Similarity Nepali Dataset

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yubraj11 (2024). Sentence Similarity Nepali Dataset [Dataset]. https://www.kaggle.com/datasets/yubraj11/sentence-similarity-nepali-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    yubraj11
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a Nepali version of the Sentence Textual Similarity Benchmark (STS-B) derived from the STS-B Multi-MT corpus. It consists of sentence pairs annotated with similarity scores, indicating how semantically similar the two sentences are. The dataset serves as a valuable resource for developing and evaluating natural language processing (NLP) models focused on understanding and measuring sentence similarity in Nepali. Each sentence pair is assigned a similarity score ranging from 0 to 5, where 0 indicates no similarity and 5 indicates complete semantic equivalence. This dataset is crucial for various NLP applications, including machine translation, paraphrase detection, and semantic search, enabling the advancement of language technologies in the Nepali language.

  11. T

    glue

    • tensorflow.org
    • tensorflow.google.cn
    • +1more
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). glue [Dataset]. https://www.tensorflow.org/datasets/catalog/glue
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('glue', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  12. h

    ST-Align-Benchmark

    • huggingface.co
    Updated Jul 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongyu Li (2025). ST-Align-Benchmark [Dataset]. https://huggingface.co/datasets/appletea2333/ST-Align-Benchmark
    Explore at:
    Dataset updated
    Jul 5, 2025
    Authors
    Hongyu Li
    Description

    appletea2333/ST-Align-Benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. PyKEEN Benchmarking Experiment Model Files

    • zenodo.org
    bin, zip
    Updated Aug 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehdi Ali; Mehdi Ali; Max Berrendorf; Max Berrendorf; Charles Tapley Hoyt; Charles Tapley Hoyt; Laurent Vermue; Galkin Mikhail; Galkin Mikhail; Laurent Vermue (2022). PyKEEN Benchmarking Experiment Model Files [Dataset]. http://doi.org/10.5281/zenodo.7018979
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mehdi Ali; Mehdi Ali; Max Berrendorf; Max Berrendorf; Charles Tapley Hoyt; Charles Tapley Hoyt; Laurent Vermue; Galkin Mikhail; Galkin Mikhail; Laurent Vermue
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Model Weights

    This repository provides weights of the models from the benchmarking study conducted in "https://arxiv.org/abs/2006.13365">"Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework" which have been upgraded to compatible with PyKEEN 1.9.

    The weights are organized as zipfiles, which are named by the dataset-interaction function configuration. For each of these combinations, we chose the best according to validation Hits@10 to include into this repository. For each model, we have three files:

    1. configuration.json contains the (pipeline) configuration used to train the model. It can loaded as
    import pathlib
    import json
    configuration = json.loads(pathlib.Path("configuration.json").read_text())

    Since the configuration is intended for the pipeline, we need some custom code to re-create the model without re-training it.

    from pykeen.datasets import get_dataset
    from pykeen.models import ERModel, model_resolver
    
    configuration = configuration["pipeline"]
    # load the triples factory
    dataset = get_dataset(
      dataset=configuration["dataset"], dataset_kwargs=configuration.get("dataset_kwargs", None)
    )
    model: ERModel = model_resolver.make(
      configuration["model"], configuration["model_kwargs"], triples_factory=dataset.training
    )

    Note, that this only creates the model instance, but does not load the weights, yet.

    1. state_dict.pt contains the weights, stored via torch.save. They can be loaded via
    import torch
    state_dict = torch.load("state_dict.pt")

    We can load these weights into the model by using Module.load_state_dict

    model.load_state_dict(state_dict, strict=False)

    Note that we set strict=False, since the exported weights do not contain regularizers' state, while the re-instantiated models may have regularizers.

    1. results.json contains the results obtained by the original runs. It can be read by
    import pathlib
    import json
    configuration = json.loads(pathlib.Path("results.json").read_text())

    Note that some of the recently added metrics are not available in those results.

  14. r

    SweParaphrase

    • researchdata.se
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Språkbanken Text (2024). SweParaphrase [Dataset]. http://doi.org/10.23695/6T6H-SS96
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    University of Gothenburg
    Authors
    Språkbanken Text
    Description

    A subset of the Semantic Textual Similarity reference data (STS Benchmark).

  15. a

    Survey Vertical Benchmark

    • data-floridaswater.opendata.arcgis.com
    • mapdirect-fdep.opendata.arcgis.com
    • +1more
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SJRWMDOpenData (2025). Survey Vertical Benchmark [Dataset]. https://data-floridaswater.opendata.arcgis.com/items/fc7738702e0b4fdba93a2f01a141d1af
    Explore at:
    Dataset updated
    Mar 24, 2025
    Dataset authored and provided by
    SJRWMDOpenData
    Area covered
    Description

    This is a dataset of vertical benchmarks collected by the St. Johns River Water Management District (SJRWMD). Maintained by Survey staff; this layer should only be updated with their direct permission.

  16. h

    ro_sts_parallel

    • huggingface.co
    Updated Mar 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dumitrescu Stefan (2021). ro_sts_parallel [Dataset]. https://huggingface.co/datasets/dumitrescustefan/ro_sts_parallel
    Explore at:
    Dataset updated
    Mar 6, 2021
    Authors
    Dumitrescu Stefan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The RO-STS-Parallel (a Parallel Romanian English dataset - translation of the Semantic Textual Similarity) contains 17256 sentences in Romanian and English. It is a high-quality translation of the English STS benchmark dataset into Romanian.

  17. r

    SweParaphrase 2.0

    • researchdata.se
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dannélls, Dana (2024). SweParaphrase 2.0 [Dataset]. http://doi.org/10.23695/HXHX-1167
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    University of Gothenburg
    Authors
    Dannélls, Dana
    Description

    Semantic Textual Similarity reference data (STS Benchmark).

  18. h

    biosses-sts

    • huggingface.co
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). biosses-sts [Dataset]. https://huggingface.co/datasets/mteb/biosses-sts
    Explore at:
    Dataset updated
    Apr 29, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BIOSSES An MTEB dataset Massive Text Embedding Benchmark

    Biomedical Semantic Similarity Estimation.

    Task category t2t

    Domains Medical

    Reference https://tabilab.cmpe.boun.edu.tr/BIOSSES/DataSet.html

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["BIOSSES"]) evaluator = mteb.MTEB(task)

    model = mteb.get_model(YOUR_MODEL) evaluator.run(model)

    To learn more… See the full description on the dataset page: https://huggingface.co/datasets/mteb/biosses-sts.

  19. h

    farsick-sts

    • huggingface.co
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MCINext (2024). farsick-sts [Dataset]. https://huggingface.co/datasets/MCINext/farsick-sts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2024
    Dataset authored and provided by
    MCINext
    Description

    Dataset Summary

    FarSick STS is a Persian (Farsi) dataset designed for the Semantic Textual Similarity (STS) task. It is a part of the FaMTEB (Farsi Massive Text Embedding Benchmark). The dataset was developed by translating and adapting the English SICK (Sentences Involving Compositional Knowledge) dataset, and it features Persian sentence pairs annotated for their degree of semantic relatedness.

    Language(s): Persian (Farsi)
    Task(s): Semantic Textual Similarity (STS)
    Source:… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/farsick-sts.

  20. f

    Control results for the four benchmark systems.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Günther; Elias Reichensdörfer; Patrick M. Pilarski; Klaus Diepold (2023). Control results for the four benchmark systems. [Dataset]. http://doi.org/10.1371/journal.pone.0243320.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Johannes Günther; Elias Reichensdörfer; Patrick M. Pilarski; Klaus Diepold
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Control results for the four benchmark systems.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Philip May (2024). stsb_multi_mt [Dataset]. https://huggingface.co/datasets/PhilipMay/stsb_multi_mt

stsb_multi_mt

STSb Multi MT

PhilipMay/stsb_multi_mt

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Authors
Philip May
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for STSb Multi MT

  Dataset Summary

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums. (source)

These are different multilingual translations and the English original of the STSbenchmark dataset. Translation has been done with deepl.com. It can be used to train sentence embeddings… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/stsb_multi_mt.

Search
Clear search
Close search
Google apps
Main menu