Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (hi) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (hi) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Big Hard Negatives Dataset
A dataset for training embedding models for semantic search. TODO: add desc A dataset in a nixietune compatible format: { "query": ")what was the immediate impact of the success of the manhattan project?", "pos": [ "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and… See the full description on the dataset page: https://huggingface.co/datasets/nixiesearch/bfhnd.
Dataset Card for Dataset Name
This dataset is similar to Free-Law-Project/opinions-synthetic-query-512, the only difference is the opinions are chunked to at most 7800 tokens instead of 480 tokens, tokenized using the bert-base-cased tokenizer with 2 sentence overlap. The number of tokens is just shy of the 8192 context window limit to account for tokenization variation between the different encoder models for experiments. The dataset is used to finetune the semantic search model… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-8192.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "PAQ_pairs"
Dataset Summary
Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.
SILMA Arabic Triplets Dataset - v1.0
Overview
The SILMA Arabic Triplets Dataset - v1.0 is a high-quality, diverse dataset specifically curated for training and training embedding models for semantic search tasks in the Arabic language. The dataset contains more than 2.25M records (2,280,319 records). This dataset includes triplets in the form of anchor, positive, and negative samples, designed to enhance models in learning semantic similarity and dissimilarity. The… See the full description on the dataset page: https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0.
Dataset Card for Free-Law-Project/opinions-synthetic-query-512
This dataset is created from the opinions-metadata, and used for training the Free Law Project Semantic Search models, including Free-Law-Project/modernbert-embed-base_finetune_512.
Dataset Details
The dataset is curated by Free Law Project by selecting the train split from the opinions-metadata dataset. The dataset is created for finetuning encoder models for semantic search, with 512 context window. The… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Arxiver Dataset
Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization.
Curation
The Arxiver dataset is… See the full description on the dataset page: https://huggingface.co/datasets/neuralwork/arxiver.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "Amazon-QA"
Dataset Summary
This dataset contains Question and Answer data from Amazon. Disclaimer: The team releasing Amazon-QA did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/Amazon-QA.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
sfia-9-chunks Dataset
Overview
The sfia-9-chunks dataset is a derived dataset from sfia-9-scraped. It uses sentence embeddings and hierarchical clustering to split each SFIA-9 document into coherent semantic chunks. This chunking facilitates more efficient downstream tasks like semantic search, question answering, and topic modeling.
Chunking Methodology
We employ the following procedure to generate chunks: from sentence_transformers import… See the full description on the dataset page: https://huggingface.co/datasets/Programmer-RD-AI/sfia-9-chunks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "simple-wiki"
Dataset Summary
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Quran Embeddings Dataset
This repository contains vector embeddings for the Holy Quran, generated using OpenAI's embedding model. These embeddings can be used for semantic search, question answering, and other natural language processing tasks related to Quranic text.
Dataset Information
The dataset consists of a single JSON file:
quran_embeddings.json: Contains embeddings for each verse (ayah) of the Quran with associated metadata
Metadata Structure
Each… See the full description on the dataset page: https://huggingface.co/datasets/promehedi/quran_embeddings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🌌 Star Wars Unlimited Card Database 🎴
✨ Dataset Overview
This repository contains card data from the Star Wars Unlimited (SWU) trading card game in two formats:
📊 Structured Card Database: A comprehensive SQLite database containing detailed card information including names, types, costs, abilities, aspects, and more.
🧠 Vector Embeddings: Card data encoded as vector embeddings for semantic search and AI applications using the all-MiniLM-L6-v2 model.
These datasets… See the full description on the dataset page: https://huggingface.co/datasets/chanfriendly/starwarsunlimited.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Query Expansion Dataset
This dataset is designed to train search query expansion models that can generate multiple semantic expansions for a given query.
Purpose
The goal of this dataset is to serve as input for training small language models (0.5B to 3B parameters) to act as query expander models in various search systems, including but not limited to Retrieval-Augmented Generation (RAG) systems. Query expansion is a technique used to enhance search results by generating… See the full description on the dataset page: https://huggingface.co/datasets/s-emanuilov/query-expansion.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenRelay Dataset
The OpenRelay Dataset is a collection of curated articles, tool reviews, user comments, and productivity-related content sourced from the OpenRelay platform. It’s designed to support training and evaluation of machine learning models for tasks such as text classification, summarization, semantic search, and question answering in the context of tech and productivity tools.
Dataset Structure
Each entry in the dataset may include fields like:
title:… See the full description on the dataset page: https://huggingface.co/datasets/openrelay/openrelay-dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains semantic search queries and keyword-based search queries tailored for a tech e-commerce application. It is designed to help train models for search intent classification, semantic search, or query understanding.
🧠 Intent Types
Semantic Queries: Natural language queries that express user intent, e.g.,
"best laptop for online classes"
"camera with good night mode under 30000 Taka"
These were generated using DeepSeek with the following prompt:
Generate a… See the full description on the dataset page: https://huggingface.co/datasets/roundspecs/tech_product_search_intent.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.
How to use
See notebook Wikipedia Q&A Retrieval-Semantic Search
Installing the package
!pip install sentence-transformers==2.3.1
The converting process
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COCO Dataset Processed with CLIP ViT-L/14
Overview
This dataset represents a processed version of the '2017 Unlabeled images' subset of the COCO dataset (COCO Dataset), utilizing the CLIP ViT-L/14 model from OpenAI. The original dataset comprises 123K images, approximately 19GB in size, which have been processed to generate 786-dimensional vectors. These vectors can be utilized for various applications like semantic search systems, image similarity assessments, and more.… See the full description on the dataset page: https://huggingface.co/datasets/s-emanuilov/coco-clip-vit-l-14.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Emoji Metadata Dataset
Overview
The LLM Emoji Dataset is a comprehensive collection of enriched semantic descriptions for emojis, generated using Meta AI's Llama-3-8B model. This dataset aims to provide semantic context for each emoji, enhancing their usability in various NLP applications, especially those requiring semantic search. The LLM Emoji Dataset was used to build a multilingual search engine for emojies, which you can interact with using this online Streamlit… See the full description on the dataset page: https://huggingface.co/datasets/badrex/LLM-generated-emoji-descriptions.
ORI Dataset
Description:This dataset contains prompts and their embeddings from multiple benchmarks:
MMLU-Pro
GPQA
HLE
LiveCodeBench
SciCode
HumanEval
Math-500
AIME
Embedding Model: SentenceTransformer bge-small-en-v1.5Embedding Dimension: 512Creation Date: 2025-05-30Other Notes: Useful for tasks like semantic search, retrieval-augmented generation, and similarity.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (hi) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (hi) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings.