30 datasets found

wikipedia-22-12-hi-embeddings
huggingface.co
Updated Apr 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-hi-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (hi) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (hi) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings.
h
bfhnd
huggingface.co
Updated Feb 27, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nixiesearch (2013). bfhnd [Dataset]. https://huggingface.co/datasets/nixiesearch/bfhnd
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2013
Dataset authored and provided by
Nixiesearch
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Big Hard Negatives Dataset

A dataset for training embedding models for semantic search. TODO: add desc A dataset in a nixietune compatible format: { "query": ")what was the immediate impact of the success of the manhattan project?", "pos": [ "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and… See the full description on the dataset page: https://huggingface.co/datasets/nixiesearch/bfhnd.
opinions-synthetic-query-8192
huggingface.co
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Free Law Project (2025). opinions-synthetic-query-8192 [Dataset]. https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-8192
Explore at:
Dataset updated
Mar 5, 2025
Dataset authored and provided by
Free Law Project
Description
Dataset Card for Dataset Name

This dataset is similar to Free-Law-Project/opinions-synthetic-query-512, the only difference is the opinions are chunked to at most 7800 tokens instead of 480 tokens, tokenized using the bert-base-cased tokenizer with 2 sentence overlap. The number of tokens is just shy of the 8192 context window limit to account for tokenization variation between the different encoder models for experiments. The dataset is used to finetune the semantic search model… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-8192.
h
PAQ_pairs
huggingface.co
Updated Sep 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2022). PAQ_pairs [Dataset]. https://huggingface.co/datasets/embedding-data/PAQ_pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 19, 2022
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "PAQ_pairs"

Dataset Summary

Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.
h
silma-arabic-triplets-dataset-v1.0
huggingface.co
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SILMA AI (2024). silma-arabic-triplets-dataset-v1.0 [Dataset]. https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2024
Dataset provided by
SILMA.AI LLC
Authors
SILMA AI
Description
SILMA Arabic Triplets Dataset - v1.0

Overview

The SILMA Arabic Triplets Dataset - v1.0 is a high-quality, diverse dataset specifically curated for training and training embedding models for semantic search tasks in the Arabic language. The dataset contains more than 2.25M records (2,280,319 records). This dataset includes triplets in the form of anchor, positive, and negative samples, designed to enhance models in learning semantic similarity and dissimilarity. The… See the full description on the dataset page: https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0.
opinions-synthetic-query-512
huggingface.co
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Free Law Project (2025). opinions-synthetic-query-512 [Dataset]. https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512
Explore at:
Dataset updated
Mar 5, 2025
Dataset authored and provided by
Free Law Project
Description
Dataset Card for Free-Law-Project/opinions-synthetic-query-512

This dataset is created from the opinions-metadata, and used for training the Free Law Project Semantic Search models, including Free-Law-Project/modernbert-embed-base_finetune_512.

Dataset Details

The dataset is curated by Free Law Project by selecting the train split from the opinions-metadata dataset. The dataset is created for finetuning encoder models for semantic search, with 512 context window. The… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512.
h
arxiver
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
neuralwork, arxiver [Dataset]. https://huggingface.co/datasets/neuralwork/arxiver
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
neuralwork
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Arxiver Dataset

Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization.

Curation

The Arxiver dataset is… See the full description on the dataset page: https://huggingface.co/datasets/neuralwork/arxiver.
h
Amazon-QA
huggingface.co
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2023). Amazon-QA [Dataset]. https://huggingface.co/datasets/embedding-data/Amazon-QA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2023
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "Amazon-QA"

Dataset Summary

This dataset contains Question and Answer data from Amazon. Disclaimer: The team releasing Amazon-QA did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/Amazon-QA.
h
sfia-9-chunks
huggingface.co
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Programmer-RD-AI (2025). sfia-9-chunks [Dataset]. http://doi.org/10.57967/hf/5747
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5747
Dataset updated
Jun 23, 2025
Authors
Programmer-RD-AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
sfia-9-chunks Dataset

Overview

The sfia-9-chunks dataset is a derived dataset from sfia-9-scraped. It uses sentence embeddings and hierarchical clustering to split each SFIA-9 document into coherent semantic chunks. This chunking facilitates more efficient downstream tasks like semantic search, question answering, and topic modeling.

Chunking Methodology

We employ the following procedure to generate chunks: from sentence_transformers import… See the full description on the dataset page: https://huggingface.co/datasets/Programmer-RD-AI/sfia-9-chunks.
h
simple-wiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "simple-wiki"

Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
h
quran_embeddings
huggingface.co
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehedi Hasan (2025). quran_embeddings [Dataset]. https://huggingface.co/datasets/promehedi/quran_embeddings
Explore at:
Dataset updated
Apr 7, 2025
Authors
Mehedi Hasan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Quran Embeddings Dataset

This repository contains vector embeddings for the Holy Quran, generated using OpenAI's embedding model. These embeddings can be used for semantic search, question answering, and other natural language processing tasks related to Quranic text.

Dataset Information

The dataset consists of a single JSON file:

quran_embeddings.json: Contains embeddings for each verse (ayah) of the Quran with associated metadata

Metadata Structure

Each… See the full description on the dataset page: https://huggingface.co/datasets/promehedi/quran_embeddings.
h
starwarsunlimited
huggingface.co
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Glass (2025). starwarsunlimited [Dataset]. https://huggingface.co/datasets/chanfriendly/starwarsunlimited
Explore at:
Dataset updated
May 8, 2025
Authors
Christian Glass
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
🌌 Star Wars Unlimited Card Database 🎴

✨ Dataset Overview

This repository contains card data from the Star Wars Unlimited (SWU) trading card game in two formats:

📊 Structured Card Database: A comprehensive SQLite database containing detailed card information including names, types, costs, abilities, aspects, and more.

🧠 Vector Embeddings: Card data encoded as vector embeddings for semantic search and AI applications using the all-MiniLM-L6-v2 model.

These datasets… See the full description on the dataset page: https://huggingface.co/datasets/chanfriendly/starwarsunlimited.
h
query-expansion
huggingface.co
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simeon Emanuilov (2024). query-expansion [Dataset]. http://doi.org/10.57967/hf/3881
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3881
Dataset updated
Dec 31, 2024
Authors
Simeon Emanuilov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Query Expansion Dataset

This dataset is designed to train search query expansion models that can generate multiple semantic expansions for a given query.

Purpose

The goal of this dataset is to serve as input for training small language models (0.5B to 3B parameters) to act as query expander models in various search systems, including but not limited to Retrieval-Augmented Generation (RAG) systems. Query expansion is a technique used to enhance search results by generating… See the full description on the dataset page: https://huggingface.co/datasets/s-emanuilov/query-expansion.
h
openrelay-dataset
huggingface.co
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
openrelay (2025). openrelay-dataset [Dataset]. https://huggingface.co/datasets/openrelay/openrelay-dataset
Explore at:
Dataset updated
Jun 8, 2025
Dataset authored and provided by
openrelay
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenRelay Dataset

The OpenRelay Dataset is a collection of curated articles, tool reviews, user comments, and productivity-related content sourced from the OpenRelay platform. It’s designed to support training and evaluation of machine learning models for tasks such as text classification, summarization, semantic search, and question answering in the context of tech and productivity tools.

Dataset Structure

Each entry in the dataset may include fields like:

title:… See the full description on the dataset page: https://huggingface.co/datasets/openrelay/openrelay-dataset.
h
tech_product_search_intent
huggingface.co
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zarif Muhtasim Showgat (2025). tech_product_search_intent [Dataset]. https://huggingface.co/datasets/roundspecs/tech_product_search_intent
Explore at:
Dataset updated
Apr 18, 2025
Authors
Zarif Muhtasim Showgat
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains semantic search queries and keyword-based search queries tailored for a tech e-commerce application. It is designed to help train models for search intent classification, semantic search, or query understanding.

🧠 Intent Types

Semantic Queries: Natural language queries that express user intent, e.g.,

"best laptop for online classes"
"camera with good night mode under 30000 Taka"

These were generated using DeepSeek with the following prompt:

Generate a… See the full description on the dataset page: https://huggingface.co/datasets/roundspecs/tech_product_search_intent.
h
RedPajama-Data-Instruct
huggingface.co
Updated Oct 15, 2004
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2004
Dataset authored and provided by
Together
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
h
simple_english_wikipedia
huggingface.co
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bowen Li (2024). simple_english_wikipedia [Dataset]. https://huggingface.co/datasets/aisuko/simple_english_wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2024
Authors
Bowen Li
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.

How to use

See notebook Wikipedia Q&A Retrieval-Semantic Search

Installing the package

!pip install sentence-transformers==2.3.1

The converting process

the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.
h
coco-clip-vit-l-14
huggingface.co
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simeon Emanuilov (2023). coco-clip-vit-l-14 [Dataset]. http://doi.org/10.57967/hf/3225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3225
Dataset updated
Nov 30, 2023
Authors
Simeon Emanuilov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COCO Dataset Processed with CLIP ViT-L/14

Overview

This dataset represents a processed version of the '2017 Unlabeled images' subset of the COCO dataset (COCO Dataset), utilizing the CLIP ViT-L/14 model from OpenAI. The original dataset comprises 123K images, approximately 19GB in size, which have been processed to generate 786-dimensional vectors. These vectors can be utilized for various applications like semantic search systems, image similarity assessments, and more.… See the full description on the dataset page: https://huggingface.co/datasets/s-emanuilov/coco-clip-vit-l-14.
h
LLM-generated-emoji-descriptions
huggingface.co
Updated Jun 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Badr Alabsi (2024). LLM-generated-emoji-descriptions [Dataset]. https://huggingface.co/datasets/badrex/LLM-generated-emoji-descriptions
Explore at:
Dataset updated
Jun 30, 2024
Authors
Badr Alabsi
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Emoji Metadata Dataset

Overview

The LLM Emoji Dataset is a comprehensive collection of enriched semantic descriptions for emojis, generated using Meta AI's Llama-3-8B model. This dataset aims to provide semantic context for each emoji, enhancing their usability in various NLP applications, especially those requiring semantic search. The LLM Emoji Dataset was used to build a multilingual search engine for emojies, which you can interact with using this online Streamlit… See the full description on the dataset page: https://huggingface.co/datasets/badrex/LLM-generated-emoji-descriptions.
h
ori_dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prudhvi, ori_dataset [Dataset]. https://huggingface.co/datasets/prudhvi-oxyz/ori_dataset
Explore at:
Authors
Prudhvi
Description
ORI Dataset

Description:This dataset contains prompts and their embeddings from multiple benchmarks:

MMLU-Pro
GPQA
HLE
LiveCodeBench
SciCode
HumanEval
Math-500
AIME

Embedding Model: SentenceTransformer bge-small-en-v1.5Embedding Dimension: 512Creation Date: 2025-05-30Other Notes: Useful for tasks like semantic search, retrieval-augmented generation, and similarity.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cohere (2023). wikipedia-22-12-hi-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings

wikipedia-22-12-hi-embeddings

Cohere/wikipedia-22-12-hi-embeddings

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 20, 2023

Dataset authored and provided by

Coherehttps://cohere.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Wikipedia (hi) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (hi) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

  Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings.

Clear search

Close search

Google apps

Main menu

wikipedia-22-12-hi-embeddings

bfhnd

opinions-synthetic-query-8192

PAQ_pairs

silma-arabic-triplets-dataset-v1.0

opinions-synthetic-query-512

arxiver

Amazon-QA

sfia-9-chunks

simple-wiki

quran_embeddings

starwarsunlimited

query-expansion

openrelay-dataset

tech_product_search_intent

RedPajama-Data-Instruct

simple_english_wikipedia

the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.

coco-clip-vit-l-14

LLM-generated-emoji-descriptions

ori_dataset

wikipedia-22-12-hi-embeddingsSee More Versions

Cohere/wikipedia-22-12-hi-embeddings

wikipedia-22-12-hi-embeddings