30 datasets found
  1. wikipedia-22-12-hi-embeddings

    • huggingface.co
    Updated Apr 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-hi-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (hi) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (hi) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings.

  2. h

    bfhnd

    • huggingface.co
    Updated Feb 27, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nixiesearch (2013). bfhnd [Dataset]. https://huggingface.co/datasets/nixiesearch/bfhnd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2013
    Dataset authored and provided by
    Nixiesearch
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Big Hard Negatives Dataset

    A dataset for training embedding models for semantic search. TODO: add desc A dataset in a nixietune compatible format: { "query": ")what was the immediate impact of the success of the manhattan project?", "pos": [ "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and… See the full description on the dataset page: https://huggingface.co/datasets/nixiesearch/bfhnd.

  3. opinions-synthetic-query-8192

    • huggingface.co
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Free Law Project (2025). opinions-synthetic-query-8192 [Dataset]. https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-8192
    Explore at:
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    Free Law Project
    Description

    Dataset Card for Dataset Name

    This dataset is similar to Free-Law-Project/opinions-synthetic-query-512, the only difference is the opinions are chunked to at most 7800 tokens instead of 480 tokens, tokenized using the bert-base-cased tokenizer with 2 sentence overlap. The number of tokens is just shy of the 8192 context window limit to account for tokenization variation between the different encoder models for experiments. The dataset is used to finetune the semantic search model… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-8192.

  4. h

    PAQ_pairs

    • huggingface.co
    Updated Sep 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2022). PAQ_pairs [Dataset]. https://huggingface.co/datasets/embedding-data/PAQ_pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 19, 2022
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "PAQ_pairs"

      Dataset Summary
    

    Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.

  5. h

    silma-arabic-triplets-dataset-v1.0

    • huggingface.co
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SILMA AI (2024). silma-arabic-triplets-dataset-v1.0 [Dataset]. https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    SILMA.AI LLC
    Authors
    SILMA AI
    Description

    SILMA Arabic Triplets Dataset - v1.0

      Overview
    

    The SILMA Arabic Triplets Dataset - v1.0 is a high-quality, diverse dataset specifically curated for training and training embedding models for semantic search tasks in the Arabic language. The dataset contains more than 2.25M records (2,280,319 records). This dataset includes triplets in the form of anchor, positive, and negative samples, designed to enhance models in learning semantic similarity and dissimilarity. The… See the full description on the dataset page: https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0.

  6. opinions-synthetic-query-512

    • huggingface.co
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Free Law Project (2025). opinions-synthetic-query-512 [Dataset]. https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512
    Explore at:
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    Free Law Project
    Description

    Dataset Card for Free-Law-Project/opinions-synthetic-query-512

    This dataset is created from the opinions-metadata, and used for training the Free Law Project Semantic Search models, including Free-Law-Project/modernbert-embed-base_finetune_512.

      Dataset Details
    

    The dataset is curated by Free Law Project by selecting the train split from the opinions-metadata dataset. The dataset is created for finetuning encoder models for semantic search, with 512 context window. The… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512.

  7. h

    arxiver

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    neuralwork, arxiver [Dataset]. https://huggingface.co/datasets/neuralwork/arxiver
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    neuralwork
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Arxiver Dataset

    Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization.

      Curation
    

    The Arxiver dataset is… See the full description on the dataset page: https://huggingface.co/datasets/neuralwork/arxiver.

  8. h

    Amazon-QA

    • huggingface.co
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2023). Amazon-QA [Dataset]. https://huggingface.co/datasets/embedding-data/Amazon-QA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2023
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "Amazon-QA"

      Dataset Summary
    

    This dataset contains Question and Answer data from Amazon. Disclaimer: The team releasing Amazon-QA did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/Amazon-QA.

  9. h

    sfia-9-chunks

    • huggingface.co
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Programmer-RD-AI (2025). sfia-9-chunks [Dataset]. http://doi.org/10.57967/hf/5747
    Explore at:
    Dataset updated
    Jun 23, 2025
    Authors
    Programmer-RD-AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    sfia-9-chunks Dataset

      Overview
    

    The sfia-9-chunks dataset is a derived dataset from sfia-9-scraped. It uses sentence embeddings and hierarchical clustering to split each SFIA-9 document into coherent semantic chunks. This chunking facilitates more efficient downstream tasks like semantic search, question answering, and topic modeling.

      Chunking Methodology
    

    We employ the following procedure to generate chunks: from sentence_transformers import… See the full description on the dataset page: https://huggingface.co/datasets/Programmer-RD-AI/sfia-9-chunks.

  10. h

    simple-wiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "simple-wiki"

      Dataset Summary
    

    This dataset contains pairs of equivalent sentences obtained from Wikipedia.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

  11. h

    quran_embeddings

    • huggingface.co
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehedi Hasan (2025). quran_embeddings [Dataset]. https://huggingface.co/datasets/promehedi/quran_embeddings
    Explore at:
    Dataset updated
    Apr 7, 2025
    Authors
    Mehedi Hasan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Quran Embeddings Dataset

    This repository contains vector embeddings for the Holy Quran, generated using OpenAI's embedding model. These embeddings can be used for semantic search, question answering, and other natural language processing tasks related to Quranic text.

      Dataset Information
    

    The dataset consists of a single JSON file:

    quran_embeddings.json: Contains embeddings for each verse (ayah) of the Quran with associated metadata

      Metadata Structure
    

    Each… See the full description on the dataset page: https://huggingface.co/datasets/promehedi/quran_embeddings.

  12. h

    starwarsunlimited

    • huggingface.co
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Glass (2025). starwarsunlimited [Dataset]. https://huggingface.co/datasets/chanfriendly/starwarsunlimited
    Explore at:
    Dataset updated
    May 8, 2025
    Authors
    Christian Glass
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    🌌 Star Wars Unlimited Card Database 🎴

      ✨ Dataset Overview
    

    This repository contains card data from the Star Wars Unlimited (SWU) trading card game in two formats:

    📊 Structured Card Database: A comprehensive SQLite database containing detailed card information including names, types, costs, abilities, aspects, and more.

    🧠 Vector Embeddings: Card data encoded as vector embeddings for semantic search and AI applications using the all-MiniLM-L6-v2 model.

    These datasets… See the full description on the dataset page: https://huggingface.co/datasets/chanfriendly/starwarsunlimited.

  13. h

    query-expansion

    • huggingface.co
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simeon Emanuilov (2024). query-expansion [Dataset]. http://doi.org/10.57967/hf/3881
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Authors
    Simeon Emanuilov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Query Expansion Dataset

    This dataset is designed to train search query expansion models that can generate multiple semantic expansions for a given query.

      Purpose
    

    The goal of this dataset is to serve as input for training small language models (0.5B to 3B parameters) to act as query expander models in various search systems, including but not limited to Retrieval-Augmented Generation (RAG) systems. Query expansion is a technique used to enhance search results by generating… See the full description on the dataset page: https://huggingface.co/datasets/s-emanuilov/query-expansion.

  14. h

    openrelay-dataset

    • huggingface.co
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    openrelay (2025). openrelay-dataset [Dataset]. https://huggingface.co/datasets/openrelay/openrelay-dataset
    Explore at:
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    openrelay
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenRelay Dataset

    The OpenRelay Dataset is a collection of curated articles, tool reviews, user comments, and productivity-related content sourced from the OpenRelay platform. It’s designed to support training and evaluation of machine learning models for tasks such as text classification, summarization, semantic search, and question answering in the context of tech and productivity tools.

      Dataset Structure
    

    Each entry in the dataset may include fields like:

    title:… See the full description on the dataset page: https://huggingface.co/datasets/openrelay/openrelay-dataset.

  15. h

    tech_product_search_intent

    • huggingface.co
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zarif Muhtasim Showgat (2025). tech_product_search_intent [Dataset]. https://huggingface.co/datasets/roundspecs/tech_product_search_intent
    Explore at:
    Dataset updated
    Apr 18, 2025
    Authors
    Zarif Muhtasim Showgat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains semantic search queries and keyword-based search queries tailored for a tech e-commerce application. It is designed to help train models for search intent classification, semantic search, or query understanding.

      🧠 Intent Types
    

    Semantic Queries: Natural language queries that express user intent, e.g.,

    "best laptop for online classes"
    "camera with good night mode under 30000 Taka"

    These were generated using DeepSeek with the following prompt:

    Generate a… See the full description on the dataset page: https://huggingface.co/datasets/roundspecs/tech_product_search_intent.

  16. h

    RedPajama-Data-Instruct

    • huggingface.co
    Updated Oct 15, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2004
    Dataset authored and provided by
    Together
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.

  17. h

    simple_english_wikipedia

    • huggingface.co
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bowen Li (2024). simple_english_wikipedia [Dataset]. https://huggingface.co/datasets/aisuko/simple_english_wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2024
    Authors
    Bowen Li
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.

      How to use
    

    See notebook Wikipedia Q&A Retrieval-Semantic Search

      Installing the package
    

    !pip install sentence-transformers==2.3.1

      The converting process
    

    the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.

  18. h

    coco-clip-vit-l-14

    • huggingface.co
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simeon Emanuilov (2023). coco-clip-vit-l-14 [Dataset]. http://doi.org/10.57967/hf/3225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Authors
    Simeon Emanuilov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COCO Dataset Processed with CLIP ViT-L/14

      Overview
    

    This dataset represents a processed version of the '2017 Unlabeled images' subset of the COCO dataset (COCO Dataset), utilizing the CLIP ViT-L/14 model from OpenAI. The original dataset comprises 123K images, approximately 19GB in size, which have been processed to generate 786-dimensional vectors. These vectors can be utilized for various applications like semantic search systems, image similarity assessments, and more.… See the full description on the dataset page: https://huggingface.co/datasets/s-emanuilov/coco-clip-vit-l-14.

  19. h

    LLM-generated-emoji-descriptions

    • huggingface.co
    Updated Jun 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Badr Alabsi (2024). LLM-generated-emoji-descriptions [Dataset]. https://huggingface.co/datasets/badrex/LLM-generated-emoji-descriptions
    Explore at:
    Dataset updated
    Jun 30, 2024
    Authors
    Badr Alabsi
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Emoji Metadata Dataset

      Overview
    

    The LLM Emoji Dataset is a comprehensive collection of enriched semantic descriptions for emojis, generated using Meta AI's Llama-3-8B model. This dataset aims to provide semantic context for each emoji, enhancing their usability in various NLP applications, especially those requiring semantic search. The LLM Emoji Dataset was used to build a multilingual search engine for emojies, which you can interact with using this online Streamlit… See the full description on the dataset page: https://huggingface.co/datasets/badrex/LLM-generated-emoji-descriptions.

  20. h

    ori_dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prudhvi, ori_dataset [Dataset]. https://huggingface.co/datasets/prudhvi-oxyz/ori_dataset
    Explore at:
    Authors
    Prudhvi
    Description

    ORI Dataset

    Description:This dataset contains prompts and their embeddings from multiple benchmarks:

    MMLU-Pro
    GPQA
    HLE
    LiveCodeBench
    SciCode
    HumanEval
    Math-500
    AIME

    Embedding Model: SentenceTransformer bge-small-en-v1.5Embedding Dimension: 512Creation Date: 2025-05-30Other Notes: Useful for tasks like semantic search, retrieval-augmented generation, and similarity.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cohere (2023). wikipedia-22-12-hi-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings
Organization logo

wikipedia-22-12-hi-embeddings

Cohere/wikipedia-22-12-hi-embeddings

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Wikipedia (hi) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (hi) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

  Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings.

Search
Clear search
Close search
Google apps
Main menu