100+ datasets found

h
embedding-training-data
huggingface.co
Updated Sep 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2021). embedding-training-data [Dataset]. https://huggingface.co/datasets/sentence-transformers/embedding-training-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2021
Dataset authored and provided by
Sentence Transformers
Description
Training Data for Text Embedding Models

This repository contains raw datasets, all of which have also been formatted for easy training in the Embedding Model Datasets collection. We recommend looking there first.

This repository contains training files to train text embedding models, e.g. using sentence-transformers.

Data Format

All files are in a jsonl.gz format: Each line contains a JSON-object that represent one training example. The JSON objects can come in… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/embedding-training-data.
h
investopedia-embedding-dataset
huggingface.co
Updated Apr 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FinLang (2024). investopedia-embedding-dataset [Dataset]. https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2024
Dataset authored and provided by
FinLang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for investopedia-embedding dataset

We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

Dataset Description… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset.
Z
Datasets and configuration files for EmbDI: Embeddings for Data Integration
data.niaid.nih.gov
zenodo.org
Updated May 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cappuzzo, Riccardo (2023). Datasets and configuration files for EmbDI: Embeddings for Data Integration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7930460
Explore at:
Dataset updated
May 13, 2023
Dataset provided by
Thirumuruganathan, Saravanan
Papotti, Paolo
Cappuzzo, Riccardo
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
License

Copyright 2020 Riccardo CAPPUZZO Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

EmbDI datasets

The datasets contained in this directory were used while working with EmbDI on the relevant paper. Please refer to the full repository for more info.

What is provided here was sourced mostly from The Magellan Data Repository. For each dataset, three tables are provided: table-A and table-B are taken from the original repository and slightly modified (lower casing, spaces were replaced by _, some special characters were removed), while the third table is the concatenation of tables A and B.

Edgelists

Edgelists are the data structures used by EmbDI. They are generated starting from each concatenated dataset and are then fed to the algorithm.

EQ tests

The EQ tests folder contains all the tests used to perform the Embeddings Quality evaluation in the paper.

The additional resources include:

(partially preprocessed) base datasets.

Their Entity Resolution and Schema Matching versions.

Edgelists for both ER and SM versions.

Ground truth files for ER and SM tasks.

Test directories for the EQ task.

Copies of the configuration files provided in this repository.

Configuration files, info files and ER match files were left in this repository in pipeline/config_files/default,

pipeline/info and pipeline/matches/default.
wikipedia-22-12-simple-embeddings
huggingface.co
opendatalab.com
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
c
ckanext-embeddings
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-embeddings [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-embeddings
Explore at:
Dataset updated
Jun 4, 2025
Description
The Embeddings extension enhances CKAN's search and discovery capabilities by leveraging machine learning embeddings. It encodes dataset metadata into numerical vectors, allowing for similarity-based comparisons between datasets and enabling semantic search functionality. This approach goes beyond simple keyword matching to consider the meaning and context of dataset information, ultimately improving data discoverability for users. Key Features: Similar Dataset Recommendations: Computes and ranks dataset embeddings against a selected dataset, returning the most semantically similar datasets. The number of returned datasets is configurable. Semantic Search: Ranks dataset embeddings against a user-provided query term to find datasets that are semantically similar to the search query.This provides a "Dense Vector Search" capability within the Solr search engine. Configurable Embedding Backends: Supports multiple backends for generating embeddings, including a local Sentence Transformers model and OpenAI's Embeddings API. Users can also implement custom backends. Customizable Solr Integration: The extension requires a custom Solr schema with a Dense Vector Search field to support semantic search and allows configuration of a different vector field to test out different models. Pluggable Embedding Model: Supports and enables the use of different embedding models, however model dimensions need to match Solr dimensions. Technical Integration: The Embeddings extension integrates with CKAN through a plugin architecture, adding a packagesimilarshow action to the CKAN API that returns similar datasets. It also modifies the package_search action to offer semantic search capabilities. The extension requires configuration settings in the CKAN ini file to specify the embedding backend, API keys (if applicable), and Solr vector field name. A custom Solr schema, provided via a Dockerfile, is needed to enable the Dense Vector Search functionality. The plugin also enables the ability to use a custom query parser in Solr. Benefits & Impact: The Embeddings extension improves data discovery within CKAN by providing a more sophisticated search mechanism that considers the semantic meaning of dataset metadata. This leads to users finding more relevant datasets, even if their search terms don't explicitly match keywords in the metadata. The ability to use different embedding backends and customize the Solr integration provides flexibility for different use cases and environments for the CKAN installation.
daily-arxiv-embeddings
kaggle.com
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ORX AI (2025). daily-arxiv-embeddings [Dataset]. https://www.kaggle.com/datasets/orxaicom/daily-arxiv-embeddings/versions/449
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ORX AI
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This CSV dataset contains the Title, Abstract,... and Embeddings of today's arXiv papers. It gets updated everyday (There are no new papers on arXiv on Saturdays and Sundays and Holidays) The notebook that calculates the Embeddings is Here. You can find the complete code to reproduce this dataset on our GitHub: https://github.com/orxaicom/daily-arxiv-embeddings We use this to visualize the arXiv papers everyday, Check it out on our website: https://www.orxai.com
Open Australian Legal Embeddings
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kaggle
Authors
Umar Butler
Area covered
Australia
Description
Open Australian Legal Embeddings ‍⚖️

The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

Usage 👩‍💻

The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

from datasets import load_dataset from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

Sample the first 100,000 embeddings.

sample = list(itertools.islice(oale, 100000))

Embed a query.

query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

Identify the most similar embedding to the query.

similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

Print the most similar text.

print(most_similar['text']) ```

To speed up the loading of the Embeddings, you may wish to install orjson.

Structure 🗂️

The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

Creation 🧪

All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
Chinese word embedding
kaggle.com
zip
Updated Jun 4, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gui_yihan (2019). Chinese word embedding [Dataset]. https://www.kaggle.com/guiyihan/chinese-word-embedding
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 4, 2019
Authors
gui_yihan
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Context

Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.

Acknowledgements

https://ai.tencent.com/ailab/nlp/embedding.html

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Z
Data from: Synthetic Multimodal Dataset for Daily Life Activities
data.niaid.nih.gov
zenodo.org
Updated Jan 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kawamura, Takahiro (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Fukuda, Ken
Ugai, Takanori
Kawamura, Takahiro
Egami, Shusaku
Swe Nwe Nwe Htun
Kozaki, Kouji
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline

This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

Video data that simulates daily life actions in a virtual space from Scenario Data.

Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

Knowledge Graph Embedding Data are created for reasoning based on machine learning

This data is open to the public as open data

Details

Videos

mp4 format

203 action scenarios

For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

Videos with slowly moving characters simulate the movements of elderly people.

Knowledge Graphs

RDF format

203 knowledge graphs corresponding to the videos

Includes schema and location supplement information

The schema is described below

SPARQL endpoints and query examples are available

Script Data

txt format

Data provided to VirtualHome2KG to generate videos and knowledge graphs

Includes the action title and a brief description in text format.

Embedding

Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

Specification of Ontology

Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

Related Resources

KGRC4SI Final Presentations with automatic English subtitles (YouTube)

VirtualHome2KG (Software)

VirtualHome-AIST (Unity)

VirtualHome-AIST (Python API)

Visualization Tool (Software)

Script Editor (Software)
P
DPR-ANN Dataset
paperswithcode.com
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cecilia Aguerrebere; Ishwar Bhati; Mark Hildebrand; Mariano Tepper; Ted Willke (2023). DPR-ANN Dataset [Dataset]. https://paperswithcode.com/dataset/dpr-ann
Explore at:
Dataset updated
Jul 24, 2023
Authors
Cecilia Aguerrebere; Ishwar Bhati; Mark Hildebrand; Mariano Tepper; Ted Willke
Description
We provide the code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models. With the dense passage retriever (DPR) [1], we encode text snippets from the C4 dataset [2] to generate 768-dimensional vectors:

context DPR embeddings for the base set and question DPR embeddings for the query set.

The metric for similarity search is inner product [1].

The number of base and query embedding vectors is parametrizable.

See the main repository for details on how to generate the DPR10M specific instance introduced in [3].

[1] Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)

[2] Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)

[3] Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed indices. In: Proceedings of the VLDB Endowment, 16, 11 (2023)
E
Data from: Ekspress news article archive (in Estonian and Russian) 1.0
live.european-language-grid.eu
binary format
Updated Apr 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Ekspress news article archive (in Estonian and Russian) 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8373
Explore at:
binary formatAvailable download formats
Dataset updated
Apr 18, 2021
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with some in Russian (325,952 articles). Keywords are included for articles after 2015.

The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets - please see README files inside those zip files.

The main archive contains JSON files of all the Estonian articles from the year 2009 to 2019 May. These datasets are intended for usage in EMBEDDIA, a H2020 project. Articles are in Estonian language with some in Russian.

The main archive is in file ee_*articles_*2009_2019. Other files contain derived versions and subsets (please see README files inside those zip files), in short:

- eearticles2015-2019: This dataset contains Estonian and Russian articles - 5 years, with tags, that were missing in the previous versions.

- files eearticles20152019lemmatized and eearticles20092014lemmatized are the files preprocessed by TEXTA (contact linda@texta.ee)

- in file eeandsttarticlelemmasembeddingsand_usage you can find w2v embeddings trained by TEXTA (contact linda@texta.ee)

Description of the Main Dataset (eearticles_2009_2019)

There are 12 JSON files:

articles_2009_ver2.json contains 161394 articles from the year 2009

articles_2010_ver2.json contains 151033 articles from the year 2010

articles_2011_ver2.json contains 168273 articles from the year 2011

articles_2012_ver2.json contains 152772 articles from the year 2012

articles_2013_ver2.json contains 141012 articles from the year 2013

articles_2014_ver2.json contains 128388 articles from the year 2014

articles_2015_ver2.json contains 127425 articles from the year 2015

articles_2016_ver2.json contains 130704 articles from the year 2016

articles_2017_ver2.json contains 119318 articles from the year 2017

articles_2018_ver2.json contains 117388 articles from the year 2018

articles_2019_Jan-Apr_ver2.json contains 35076 articles from the year 2019 January to April

articles_2019_May_ver2.json contains 8329 articles from the year 2019 May

In sum: 1 441 112 articles

Each JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following:

id (integer) - the ID of the article

title (string) - the title of the article

lead (string) - the lead of the article (can contain HTML, e.g. tag)

url (string) - the URL of the article

tags (list of dictionaries or None) [1]: each dictionary represents one tag. The tag dictionary contains the following:

domain_id (string) [2] - the ID of the domain

id (string) - the ID of the tag

lang (string) - the language of the tag

tag (string) - the tag itself, e.g. Kert Kingo (a name)

translitted_name (string) - a modified version of the tag, e.g. kert-kingo

rawBody (string) - the raw text of the article (contains HTML)

bodyText (string) - clean article text (stripped from HTML)

publishDate (string) - published date & time of the article

categoryPrimary (dictionary or empty list) - the dictionary contains the following information:

categoryId (integer) - the ID of the category

categoryName (string)- the name of the category (e.g. World)

channelId (integer) - the ID of the channel

OR

articleId (integer) - the ID of the article

categoryId (integer) - the ID of the category

categoryName (string)- the name of the category (e.g. World)

categoryPrimary (boolean) - unknown

categorySort (integer) - unknown

categoryUrl (string) - the URL of the category

categoryVisible (boolean) - unknown

channelId (integer) - the ID of the channel

channelUrl (string) - the URL of the channel (e.g. 'https://sport.delfi.ee')

directoryName (string) - unknown

parentId (integer) - unknown

channelLanguage (string or None) [3] - the language of the channel

categoryLanguage (int or None) [4] -unknown

commentCount (int) [5] - the number of comments

relatedArticles (list of integers) - a list of related articles' ID's
n
Data from: Exploiting hierarchy in medical concept embedding
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg (2021). Exploiting hierarchy in medical concept embedding [Dataset]. http://doi.org/10.5061/dryad.v9s4mw6v0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.v9s4mw6v0
Dataset updated
Oct 27, 2021
Dataset provided by
Mid-Atlantic Permanente Research Institute
Mid-Atlantic Permanente Medical Group
Authors
Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective

To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.

Materials and Methods

We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly-released pretrained embeddings and alternative training methodologies.

Results

We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% Normalized Mutual Information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.

Discussion

We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.

Methods This dataset includes trained medical concept embeddings for 5428 ICD-10 codes and 394 Clinical Classification Software (Revised) (CCSR) categories. We include several different sets of concept embeddings, each trained using a slightly different set of hyperparameters and algorithms.

To train our models, we employed data from the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system. KPMAS is an integrated medical system serving approximately 780,000 members in Maryland, Virginia, and the District of Columbia. KPMAS has a comprehensive Electronic Medical Record system which includes data from all patient interactions with primary or specialty caregivers, from which all data is derived. Our embeddings training set included diagnoses allocated to all adult patients in calendar year 2019.

For each code, we also recovered an associated category, as assigned by the Clinical Classification Software (Revised).

We trained 12 sets of embeddings using classical Word2Vec models with settings differing across three parameters. Our first parameter was the selection of training algorithm, where we trained both CBOW and SG models. Each model was trained using dimension k of 10, 50, and 100. Furthermore, each model-dimension combination was trained with categories and codes trained separately and together (referred to hereafter as ‘co-trained embeddings’ or ‘co-embeddings’). Each model was trained for 10 iterations. We employed an arbitrarily large context window (100), since all codes necessarily occurred within a short period (1 year).

We also trained a set of validation embeddings only on ICD-10 codes using the Med2Vec architecture as a comparison. We trained the Med2Vec model on our data using its default settings, including the default vector size (200) and a training regime of 10 epochs. We grouped all codes occurring on the same calendar date as Med2Vec ‘visits.’ Our Med2Vec model benchmark did not include categorical entities or other novel innovations.

Word2Vec embeddings were generated using the GenSim package in Python. Med2Vec embeddings were generated using the Med2Vec code published by Choi. The JSON files used in this repository were generated using the JSON package in Python.
Datasets from "Electrostatic Embedding of Machine Learning Potentials"
zenodo.org
explore.openaire.eu
application/gzip, bin
Updated Sep 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kirill Zinovjev; Kirill Zinovjev (2022). Datasets from "Electrostatic Embedding of Machine Learning Potentials" [Dataset]. http://doi.org/10.5281/zenodo.7051785
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7051785
Dataset updated
Sep 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kirill Zinovjev; Kirill Zinovjev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data required to reproduce results in "Electrostatic Embedding of Machine Learning Potentials" article. See https://github.com/emedio/embedding for details.

QM7_B3LYP_cc-pVTZ.tgz - outputs of single point B3LYP/cc-pVTZ calculations of structures in QM7 dataset with ORCA 5. Include molecular dipolar polarizabilities.

QM7_B3LYP_cc-pVTZ_horton.tgz - MBIS partitioning of the B3LYP/cc-pVTZ densities with Horton 2.1.0.

mpro_xyz.tgz - coordinates of the ligand and surrounding point charges from 100 snapshots of SARS-CoV-2 Mpro complex with PF-00835231.

mpro_*.tgz - DFT and semiempirical single point calculations with ORCA 5 for the coordinates from mpro_xyz.tgz, in vacuo and in presence of point charges.

mlmm.mat - learned parameters and SOAP feature vectors of reference atomic environments
arXiv embeddings
kaggle.com
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
August Wester (2025). arXiv embeddings [Dataset]. https://www.kaggle.com/datasets/awester/arxiv-embeddings/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
August Wester
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
These are the embeddings powering searchthearxiv.com, a semantic search engine for more than a decade's worth of ML papers published on arXiv. The embeddings are created by running OpenAI's text-embedding-ada-002 model on an "augmented" abstract for each paper (title+authors+year+abstract). The papers are sourced from the metadataset published by Cornell University and filtered to include only papers belonging to at least one of the following categories:

cs.cv, cs.lg, cs.cl, cs.ai, cs.ne, and cs.ro

The dataset is updated on a weekly basis (in lockstep with the official arXiv metadataset).

If you find the dataset and/or searchthearxiv.com useful, consider giving the repo a ⭐️ on GitHub.
c
Supporting information for Neural Network Embeddings based Similarity Search...
kilthub.cmu.edu
txt
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yilin Yang; Mingjie Liu; John Kitchin (2022). Supporting information for Neural Network Embeddings based Similarity Search Method for Catalyst Systems [Dataset]. http://doi.org/10.1184/R1/19968323.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1184/R1/19968323.v1
Dataset updated
Jun 3, 2022
Dataset provided by
Carnegie Mellon University
Authors
Yilin Yang; Mingjie Liu; John Kitchin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this repository, we included code to prepare dataset, train gemnet model, build the faiss index, search the faiss index and visualize the searched results in the notebook faiss-gemnet-qm9-mp.ipynb. It reproduced our examples in the manuscript for the QM9 and the Materials Project dataset. For the OC20 dataset, we did not include its related data here because of its large size (> 50 GB), the code to process the OC20 dataset is almost the same as the code included in the notebook for the QM9 dataset.

We include the intermediate data (GemNet checkpoints, lmdb, faiss index and the searched result for the QM9 and the Materials project in the directory example-data. We also put the GemNet checkpoint for the OC20 dataset in this directory. The training and evaluation of the Gaussian regression process model using the searched molecules for the query Benzene are demonstrated in the ben-gp-data directory, in which the qm9-gp-gemnet-morgan-random-nrg.ipynb can be run on Colab.
Bangla_Problem_Embeddings
kaggle.com
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Asif Khan (2024). Bangla_Problem_Embeddings [Dataset]. https://www.kaggle.com/datasets/aliasifkhan131/bangla-problem-embeddings/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Asif Khan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We gathered Bangla math dataset with solutions and created embeddings. It can be used to find similar problems while doing RAG or few shot prompting just but loading it and doing the faiss index search. The dataset (which was used to do the embedding) has a size of 126675 problems. Here's the dataset link: https://www.kaggle.com/datasets/sourav2083/math-bangla/data
Z
Data from: DBpedia RDF2Vec Graph Embeddings
data.niaid.nih.gov
explore.openaire.eu
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christensen, Martin Pekár (2022). DBpedia RDF2Vec Graph Embeddings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6376305
Explore at:
Dataset updated
Mar 25, 2022
Dataset provided by
Lissandrini, Matteo
Christensen, Martin Pekár
Hose, Katja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DBpedia graph embeddings using RDF2Vec. RDF2Vec embedding generation code can be found here and is based on a publication by Portisch et al. [1].

The embeddings dataset consists of 200-dimensional vectors of DBpedia entities (from 1/9/2021).

Generating Embeddings

The code for generating these embeddings can be found here.

Run the run.sh script that wraps all the necessary commmands to generate embeddings

bash run.sh

The script downloads a set of DBpedia files, which are listed in dbpedia_files.txt. It then builds a Docker image and runs a container of that image that generates the embeddings for the DBpedia graph defined by the DBpedia files.

A folder files is created containing all the downloaded DBpedia files, and a folder embeddings/dbpedia is created containing the embeddings in vectors.txt along a set of random walk files.

Run Time of Embeddings Generation

Generating embeddings can take more than a day, but it depends on the number of DBpedia files chosen to be downloaded. Following are some basic run time statistics when embeddings are generated on a 64 GB RAM, 8 cores (AMD EPYC), 1 TB SSD, 1996.221 MHz machine.

Total: 1 day, 8 hours, 52 minutes, 41 seconds

Walk generation: 0 days, 7 minutes, 24 minutes, 36 seconds

Training: 1 day, 1 hour, 28 minutes, 5 seconds

Parameters Used

Here is listed the parameters used to generate the embeddings provided here:

Number of walks per entity: 100

Depth (hops) per walk: 4

Walk generation mode: RANDOM_WALKS_DUPLICATE_FREE

Threads: # of processors / 2

Training mode: sg

Embeddings vector dimension: 200

Minimum word2vec word count: 1

Sample rate: 0.0

Training window size: 5

Training epochs: 5
Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...
zenodo.org
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7909511
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0
Example of a row in entity2id.txt:
/g/112yfy2xr, 2
Example of a row in relation2id.txt:
/music/album/release_type, 192
Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90
Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person
Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person"
object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77
Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147
Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867
Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
Explore at:
application/gzip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11391315
Dataset updated
Jun 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
Time period covered
May 29, 2024
Description
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2

Paper: to be published
Z
Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...
data.niaid.nih.gov
zenodo.org
Updated Apr 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koutra, Danai (2020). Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with Calibration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3738263
Explore at:
Dataset updated
Apr 2, 2020
Dataset provided by
Meij, Edgar
Koutra, Danai
Safavi, Tara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.

Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.

wikidata-authors

This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.

The files are as follows:

entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

eid: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

label: A human-readable label of this entity (extracted from Wikidata).

relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

rid: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:.

label: A human-readable label of this relation (extracted from Wikidata).

triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

fb15krr-linked

This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.

The files are as follows:

entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

mid: The Freebase machine ID (MID) of this entity.

wiki: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

label: A human-readable label of this entity (extracted from Wikidata).

types: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].

relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

label: The hierarchical Freebase label of this relation.

triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

Facebook

Twitter

Click to copy link

Link copied

Cite

Sentence Transformers (2021). embedding-training-data [Dataset]. https://huggingface.co/datasets/sentence-transformers/embedding-training-data

embedding-training-data

sentence-transformers/embedding-training-data

Explore at:

121 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 9, 2021

Dataset authored and provided by

Sentence Transformers

Description

Training Data for Text Embedding Models

This repository contains raw datasets, all of which have also been formatted for easy training in the Embedding Model Datasets collection. We recommend looking there first.

This repository contains training files to train text embedding models, e.g. using sentence-transformers.

  Data Format

All files are in a jsonl.gz format: Each line contains a JSON-object that represent one training example. The JSON objects can come in… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/embedding-training-data.

Clear search

Close search

Google apps

Main menu

embedding-training-data

investopedia-embedding-dataset

Datasets and configuration files for EmbDI: Embeddings for Data Integration

License

EmbDI datasets

Edgelists

EQ tests

wikipedia-22-12-simple-embeddings

ckanext-embeddings

daily-arxiv-embeddings

Open Australian Legal Embeddings

Open Australian Legal Embeddings ‍⚖️

Usage 👩‍💻

Sample the first 100,000 embeddings.

Embed a query.

Identify the most similar embedding to the query.

Print the most similar text.

Structure 🗂️

Creation 🧪

Chinese word embedding

Context

Acknowledgements

Inspiration

Data from: Synthetic Multimodal Dataset for Daily Life Activities

DPR-ANN Dataset

Data from: Ekspress news article archive (in Estonian and Russian) 1.0

Data from: Exploiting hierarchy in medical concept embedding

Datasets from "Electrostatic Embedding of Machine Learning Potentials"

arXiv embeddings

Supporting information for Neural Network Embeddings based Similarity Search...

Bangla_Problem_Embeddings

Data from: DBpedia RDF2Vec Graph Embeddings

Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...

embedding-training-data

sentence-transformers/embedding-training-data