100+ datasets found
  1. h

    embedding-training-data

    • huggingface.co
    Updated Sep 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2021). embedding-training-data [Dataset]. https://huggingface.co/datasets/sentence-transformers/embedding-training-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2021
    Dataset authored and provided by
    Sentence Transformers
    Description

    Training Data for Text Embedding Models

    This repository contains raw datasets, all of which have also been formatted for easy training in the Embedding Model Datasets collection. We recommend looking there first.

    This repository contains training files to train text embedding models, e.g. using sentence-transformers.

      Data Format
    

    All files are in a jsonl.gz format: Each line contains a JSON-object that represent one training example. The JSON objects can come in… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/embedding-training-data.

  2. h

    investopedia-embedding-dataset

    • huggingface.co
    Updated Apr 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FinLang (2024). investopedia-embedding-dataset [Dataset]. https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2024
    Dataset authored and provided by
    FinLang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for investopedia-embedding dataset

    We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

      Dataset Description… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset.
    
  3. Z

    Datasets and configuration files for EmbDI: Embeddings for Data Integration

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cappuzzo, Riccardo (2023). Datasets and configuration files for EmbDI: Embeddings for Data Integration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7930460
    Explore at:
    Dataset updated
    May 13, 2023
    Dataset provided by
    Thirumuruganathan, Saravanan
    Papotti, Paolo
    Cappuzzo, Riccardo
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    License

      Copyright 2020 Riccardo CAPPUZZO
    
    
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
    
       http://www.apache.org/licenses/LICENSE-2.0
    
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License.
    

    EmbDI datasets

    The datasets contained in this directory were used while working with EmbDI on the relevant paper. Please refer to the full repository for more info.

    What is provided here was sourced mostly from The Magellan Data Repository. For each dataset, three tables are provided: table-A and table-B are taken from the original repository and slightly modified (lower casing, spaces were replaced by _, some special characters were removed), while the third table is the concatenation of tables A and B.

    Edgelists

    Edgelists are the data structures used by EmbDI. They are generated starting from each concatenated dataset and are then fed to the algorithm.

    EQ tests

    The EQ tests folder contains all the tests used to perform the Embeddings Quality evaluation in the paper.

    The additional resources include:

    • (partially preprocessed) base datasets.

    • Their Entity Resolution and Schema Matching versions.

    • Edgelists for both ER and SM versions.

    • Ground truth files for ER and SM tasks.

    • Test directories for the EQ task.

    • Copies of the configuration files provided in this repository.

    Configuration files, info files and ER match files were left in this repository in pipeline/config_files/default,

    pipeline/info and pipeline/matches/default.

  4. wikipedia-22-12-simple-embeddings

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.

  5. c

    ckanext-embeddings

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-embeddings [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-embeddings
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The Embeddings extension enhances CKAN's search and discovery capabilities by leveraging machine learning embeddings. It encodes dataset metadata into numerical vectors, allowing for similarity-based comparisons between datasets and enabling semantic search functionality. This approach goes beyond simple keyword matching to consider the meaning and context of dataset information, ultimately improving data discoverability for users. Key Features: Similar Dataset Recommendations: Computes and ranks dataset embeddings against a selected dataset, returning the most semantically similar datasets. The number of returned datasets is configurable. Semantic Search: Ranks dataset embeddings against a user-provided query term to find datasets that are semantically similar to the search query.This provides a "Dense Vector Search" capability within the Solr search engine. Configurable Embedding Backends: Supports multiple backends for generating embeddings, including a local Sentence Transformers model and OpenAI's Embeddings API. Users can also implement custom backends. Customizable Solr Integration: The extension requires a custom Solr schema with a Dense Vector Search field to support semantic search and allows configuration of a different vector field to test out different models. Pluggable Embedding Model: Supports and enables the use of different embedding models, however model dimensions need to match Solr dimensions. Technical Integration: The Embeddings extension integrates with CKAN through a plugin architecture, adding a packagesimilarshow action to the CKAN API that returns similar datasets. It also modifies the package_search action to offer semantic search capabilities. The extension requires configuration settings in the CKAN ini file to specify the embedding backend, API keys (if applicable), and Solr vector field name. A custom Solr schema, provided via a Dockerfile, is needed to enable the Dense Vector Search functionality. The plugin also enables the ability to use a custom query parser in Solr. Benefits & Impact: The Embeddings extension improves data discovery within CKAN by providing a more sophisticated search mechanism that considers the semantic meaning of dataset metadata. This leads to users finding more relevant datasets, even if their search terms don't explicitly match keywords in the metadata. The ability to use different embedding backends and customize the Solr integration provides flexibility for different use cases and environments for the CKAN installation.

  6. daily-arxiv-embeddings

    • kaggle.com
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ORX AI (2025). daily-arxiv-embeddings [Dataset]. https://www.kaggle.com/datasets/orxaicom/daily-arxiv-embeddings/versions/449
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ORX AI
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This CSV dataset contains the Title, Abstract,... and Embeddings of today's arXiv papers. It gets updated everyday (There are no new papers on arXiv on Saturdays and Sundays and Holidays) The notebook that calculates the Embeddings is Here. You can find the complete code to reproduce this dataset on our GitHub: https://github.com/orxaicom/daily-arxiv-embeddings We use this to visualize the arXiv papers everyday, Check it out on our website: https://www.orxai.com

  7. Open Australian Legal Embeddings

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kaggle
    Authors
    Umar Butler
    Area covered
    Australia
    Description

    Open Australian Legal Embeddings ‍⚖️

    The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

    Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

    The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

    To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

    Usage 👩‍💻

    The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

    from datasets import load_dataset from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

    oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

    Sample the first 100,000 embeddings.

    sample = list(itertools.islice(oale, 100000))

    Embed a query.

    query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

    Identify the most similar embedding to the query.

    similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

    Print the most similar text.

    print(most_similar['text']) ```

    To speed up the loading of the Embeddings, you may wish to install orjson.

    Structure 🗂️

    The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

    The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

    Creation 🧪

    All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

    The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

    The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

    The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...

  8. Chinese word embedding

    • kaggle.com
    zip
    Updated Jun 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gui_yihan (2019). Chinese word embedding [Dataset]. https://www.kaggle.com/guiyihan/chinese-word-embedding
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jun 4, 2019
    Authors
    gui_yihan
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Context

    Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

    This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.

    Acknowledgements

    https://ai.tencent.com/ailab/nlp/embedding.html

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  9. Z

    Data from: Synthetic Multimodal Dataset for Daily Life Activities

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kawamura, Takahiro (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
    Explore at:
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    Fukuda, Ken
    Ugai, Takanori
    Kawamura, Takahiro
    Egami, Shusaku
    Swe Nwe Nwe Htun
    Kozaki, Kouji
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outline

    This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

    Video data that simulates daily life actions in a virtual space from Scenario Data.

    Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

    Knowledge Graph Embedding Data are created for reasoning based on machine learning

    This data is open to the public as open data

    Details

    Videos

    mp4 format

    203 action scenarios

    For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

    Videos with slowly moving characters simulate the movements of elderly people.

    Knowledge Graphs

    RDF format

    203 knowledge graphs corresponding to the videos

    Includes schema and location supplement information

    The schema is described below

    SPARQL endpoints and query examples are available

    Script Data

    txt format

    Data provided to VirtualHome2KG to generate videos and knowledge graphs

    Includes the action title and a brief description in text format.

    Embedding

    Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

    Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

    Specification of Ontology

    Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

    Related Resources

    KGRC4SI Final Presentations with automatic English subtitles (YouTube)

    VirtualHome2KG (Software)

    VirtualHome-AIST (Unity)

    VirtualHome-AIST (Python API)

    Visualization Tool (Software)

    Script Editor (Software)

  10. P

    DPR-ANN Dataset

    • paperswithcode.com
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cecilia Aguerrebere; Ishwar Bhati; Mark Hildebrand; Mariano Tepper; Ted Willke (2023). DPR-ANN Dataset [Dataset]. https://paperswithcode.com/dataset/dpr-ann
    Explore at:
    Dataset updated
    Jul 24, 2023
    Authors
    Cecilia Aguerrebere; Ishwar Bhati; Mark Hildebrand; Mariano Tepper; Ted Willke
    Description

    We provide the code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models. With the dense passage retriever (DPR) [1], we encode text snippets from the C4 dataset [2] to generate 768-dimensional vectors:

    context DPR embeddings for the base set and question DPR embeddings for the query set.

    The metric for similarity search is inner product [1].

    The number of base and query embedding vectors is parametrizable.

    See the main repository for details on how to generate the DPR10M specific instance introduced in [3].

    [1] Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)

    [2] Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)

    [3] Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed indices. In: Proceedings of the VLDB Endowment, 16, 11 (2023)

  11. E

    Data from: Ekspress news article archive (in Estonian and Russian) 1.0

    • live.european-language-grid.eu
    binary format
    Updated Apr 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Ekspress news article archive (in Estonian and Russian) 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8373
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 18, 2021
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with some in Russian (325,952 articles). Keywords are included for articles after 2015.

    The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets - please see README files inside those zip files.

    The main archive contains JSON files of all the Estonian articles from the year 2009 to 2019 May. These datasets are intended for usage in EMBEDDIA, a H2020 project. Articles are in Estonian language with some in Russian.

    The main archive is in file ee_*articles_*2009_2019. Other files contain derived versions and subsets (please see README files inside those zip files), in short:

    - eearticles2015-2019: This dataset contains Estonian and Russian articles - 5 years, with tags, that were missing in the previous versions.

    - files eearticles20152019lemmatized and eearticles20092014lemmatized are the files preprocessed by TEXTA (contact linda@texta.ee)

    - in file eeandsttarticlelemmasembeddingsand_usage you can find w2v embeddings trained by TEXTA (contact linda@texta.ee)

    Description of the Main Dataset (eearticles_2009_2019)

    There are 12 JSON files:

    articles_2009_ver2.json contains 161394 articles from the year 2009

    articles_2010_ver2.json contains 151033 articles from the year 2010

    articles_2011_ver2.json contains 168273 articles from the year 2011

    articles_2012_ver2.json contains 152772 articles from the year 2012

    articles_2013_ver2.json contains 141012 articles from the year 2013

    articles_2014_ver2.json contains 128388 articles from the year 2014

    articles_2015_ver2.json contains 127425 articles from the year 2015

    articles_2016_ver2.json contains 130704 articles from the year 2016

    articles_2017_ver2.json contains 119318 articles from the year 2017

    articles_2018_ver2.json contains 117388 articles from the year 2018

    articles_2019_Jan-Apr_ver2.json contains 35076 articles from the year 2019 January to April

    articles_2019_May_ver2.json contains 8329 articles from the year 2019 May

    In sum: 1 441 112 articles

    Each JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following:

    id (integer) - the ID of the article

    title (string) - the title of the article

    lead (string) - the lead of the article (can contain HTML, e.g. tag)

    url (string) - the URL of the article

    tags (list of dictionaries or None) [1]: each dictionary represents one tag. The tag dictionary contains the following:

    domain_id (string) [2] - the ID of the domain

    id (string) - the ID of the tag

    lang (string) - the language of the tag

    tag (string) - the tag itself, e.g. Kert Kingo (a name)

    translitted_name (string) - a modified version of the tag, e.g. kert-kingo

    rawBody (string) - the raw text of the article (contains HTML)

    bodyText (string) - clean article text (stripped from HTML)

    publishDate (string) - published date & time of the article

    categoryPrimary (dictionary or empty list) - the dictionary contains the following information:

    categoryId (integer) - the ID of the category

    categoryName (string)- the name of the category (e.g. World)

    channelId (integer) - the ID of the channel

    OR

    articleId (integer) - the ID of the article

    categoryId (integer) - the ID of the category

    categoryName (string)- the name of the category (e.g. World)

    categoryPrimary (boolean) - unknown

    categorySort (integer) - unknown

    categoryUrl (string) - the URL of the category

    categoryVisible (boolean) - unknown

    channelId (integer) - the ID of the channel

    channelUrl (string) - the URL of the channel (e.g. 'https://sport.delfi.ee')

    directoryName (string) - unknown

    parentId (integer) - unknown

    channelLanguage (string or None) [3] - the language of the channel

    categoryLanguage (int or None) [4] -unknown

    commentCount (int) [5] - the number of comments

    relatedArticles (list of integers) - a list of related articles' ID's

  12. n

    Data from: Exploiting hierarchy in medical concept embedding

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg (2021). Exploiting hierarchy in medical concept embedding [Dataset]. http://doi.org/10.5061/dryad.v9s4mw6v0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    Mid-Atlantic Permanente Research Institute
    Mid-Atlantic Permanente Medical Group
    Authors
    Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective

    To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.

    Materials and Methods

    We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly-released pretrained embeddings and alternative training methodologies.

    Results

    We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% Normalized Mutual Information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.

    Discussion

    We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.

    Methods This dataset includes trained medical concept embeddings for 5428 ICD-10 codes and 394 Clinical Classification Software (Revised) (CCSR) categories. We include several different sets of concept embeddings, each trained using a slightly different set of hyperparameters and algorithms.

    To train our models, we employed data from the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system. KPMAS is an integrated medical system serving approximately 780,000 members in Maryland, Virginia, and the District of Columbia. KPMAS has a comprehensive Electronic Medical Record system which includes data from all patient interactions with primary or specialty caregivers, from which all data is derived. Our embeddings training set included diagnoses allocated to all adult patients in calendar year 2019.

    For each code, we also recovered an associated category, as assigned by the Clinical Classification Software (Revised).

    We trained 12 sets of embeddings using classical Word2Vec models with settings differing across three parameters. Our first parameter was the selection of training algorithm, where we trained both CBOW and SG models. Each model was trained using dimension k of 10, 50, and 100. Furthermore, each model-dimension combination was trained with categories and codes trained separately and together (referred to hereafter as ‘co-trained embeddings’ or ‘co-embeddings’). Each model was trained for 10 iterations. We employed an arbitrarily large context window (100), since all codes necessarily occurred within a short period (1 year).

    We also trained a set of validation embeddings only on ICD-10 codes using the Med2Vec architecture as a comparison. We trained the Med2Vec model on our data using its default settings, including the default vector size (200) and a training regime of 10 epochs. We grouped all codes occurring on the same calendar date as Med2Vec ‘visits.’ Our Med2Vec model benchmark did not include categorical entities or other novel innovations.

    Word2Vec embeddings were generated using the GenSim package in Python. Med2Vec embeddings were generated using the Med2Vec code published by Choi. The JSON files used in this repository were generated using the JSON package in Python.

  13. Datasets from "Electrostatic Embedding of Machine Learning Potentials"

    • zenodo.org
    • explore.openaire.eu
    application/gzip, bin
    Updated Sep 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirill Zinovjev; Kirill Zinovjev (2022). Datasets from "Electrostatic Embedding of Machine Learning Potentials" [Dataset]. http://doi.org/10.5281/zenodo.7051785
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Sep 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kirill Zinovjev; Kirill Zinovjev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data required to reproduce results in "Electrostatic Embedding of Machine Learning Potentials" article. See https://github.com/emedio/embedding for details.

    • QM7_B3LYP_cc-pVTZ.tgz - outputs of single point B3LYP/cc-pVTZ calculations of structures in QM7 dataset with ORCA 5. Include molecular dipolar polarizabilities.
    • QM7_B3LYP_cc-pVTZ_horton.tgz - MBIS partitioning of the B3LYP/cc-pVTZ densities with Horton 2.1.0.
    • mpro_xyz.tgz - coordinates of the ligand and surrounding point charges from 100 snapshots of SARS-CoV-2 Mpro complex with PF-00835231.
    • mpro_*.tgz - DFT and semiempirical single point calculations with ORCA 5 for the coordinates from mpro_xyz.tgz, in vacuo and in presence of point charges.
    • mlmm.mat - learned parameters and SOAP feature vectors of reference atomic environments
  14. arXiv embeddings

    • kaggle.com
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    August Wester (2025). arXiv embeddings [Dataset]. https://www.kaggle.com/datasets/awester/arxiv-embeddings/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    August Wester
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    These are the embeddings powering searchthearxiv.com, a semantic search engine for more than a decade's worth of ML papers published on arXiv. The embeddings are created by running OpenAI's text-embedding-ada-002 model on an "augmented" abstract for each paper (title+authors+year+abstract). The papers are sourced from the metadataset published by Cornell University and filtered to include only papers belonging to at least one of the following categories:

    cs.cv, cs.lg, cs.cl, cs.ai, cs.ne, and cs.ro

    The dataset is updated on a weekly basis (in lockstep with the official arXiv metadataset).

    If you find the dataset and/or searchthearxiv.com useful, consider giving the repo a ⭐️ on GitHub.

  15. c

    Supporting information for Neural Network Embeddings based Similarity Search...

    • kilthub.cmu.edu
    txt
    Updated Jun 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yilin Yang; Mingjie Liu; John Kitchin (2022). Supporting information for Neural Network Embeddings based Similarity Search Method for Catalyst Systems [Dataset]. http://doi.org/10.1184/R1/19968323.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2022
    Dataset provided by
    Carnegie Mellon University
    Authors
    Yilin Yang; Mingjie Liu; John Kitchin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we included code to prepare dataset, train gemnet model, build the faiss index, search the faiss index and visualize the searched results in the notebook faiss-gemnet-qm9-mp.ipynb. It reproduced our examples in the manuscript for the QM9 and the Materials Project dataset. For the OC20 dataset, we did not include its related data here because of its large size (> 50 GB), the code to process the OC20 dataset is almost the same as the code included in the notebook for the QM9 dataset.

    We include the intermediate data (GemNet checkpoints, lmdb, faiss index and the searched result for the QM9 and the Materials project in the directory example-data. We also put the GemNet checkpoint for the OC20 dataset in this directory. The training and evaluation of the Gaussian regression process model using the searched molecules for the query Benzene are demonstrated in the ben-gp-data directory, in which the qm9-gp-gemnet-morgan-random-nrg.ipynb can be run on Colab.

  16. Bangla_Problem_Embeddings

    • kaggle.com
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Asif Khan (2024). Bangla_Problem_Embeddings [Dataset]. https://www.kaggle.com/datasets/aliasifkhan131/bangla-problem-embeddings/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Asif Khan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We gathered Bangla math dataset with solutions and created embeddings. It can be used to find similar problems while doing RAG or few shot prompting just but loading it and doing the faiss index search. The dataset (which was used to do the embedding) has a size of 126675 problems. Here's the dataset link: https://www.kaggle.com/datasets/sourav2083/math-bangla/data

  17. Z

    Data from: DBpedia RDF2Vec Graph Embeddings

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christensen, Martin Pekár (2022). DBpedia RDF2Vec Graph Embeddings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6376305
    Explore at:
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Lissandrini, Matteo
    Christensen, Martin Pekár
    Hose, Katja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DBpedia graph embeddings using RDF2Vec. RDF2Vec embedding generation code can be found here and is based on a publication by Portisch et al. [1].

    The embeddings dataset consists of 200-dimensional vectors of DBpedia entities (from 1/9/2021).

    Generating Embeddings

    The code for generating these embeddings can be found here.

    Run the run.sh script that wraps all the necessary commmands to generate embeddings

    bash run.sh

    The script downloads a set of DBpedia files, which are listed in dbpedia_files.txt. It then builds a Docker image and runs a container of that image that generates the embeddings for the DBpedia graph defined by the DBpedia files.

    A folder files is created containing all the downloaded DBpedia files, and a folder embeddings/dbpedia is created containing the embeddings in vectors.txt along a set of random walk files.

    Run Time of Embeddings Generation

    Generating embeddings can take more than a day, but it depends on the number of DBpedia files chosen to be downloaded. Following are some basic run time statistics when embeddings are generated on a 64 GB RAM, 8 cores (AMD EPYC), 1 TB SSD, 1996.221 MHz machine.

    Total: 1 day, 8 hours, 52 minutes, 41 seconds

    Walk generation: 0 days, 7 minutes, 24 minutes, 36 seconds

    Training: 1 day, 1 hour, 28 minutes, 5 seconds

    Parameters Used

    Here is listed the parameters used to generate the embeddings provided here:

    Number of walks per entity: 100

    Depth (hops) per walk: 4

    Walk generation mode: RANDOM_WALKS_DUPLICATE_FREE

    Threads: # of processors / 2

    Training mode: sg

    Embeddings vector dimension: 200

    Minimum word2vec word count: 1

    Sample rate: 0.0

    Training window size: 5

    Training epochs: 5

  18. Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

    • zenodo.org
    zip
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

    Dataset Details

    The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

    • Subject matter triples file
      • fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
        • Example of a row in train.txt, valid.txt, and test.txt:
          • 2, 192, 0
        • Example of a row in entity2id.txt:
          • /g/112yfy2xr, 2
        • Example of a row in relation2id.txt:
          • /music/album/release_type, 192
        • Explaination
          • "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
    • Type system file
      • freebase_endtypes: Each row maps an edge type to its required subject type and object type.
        • Example
          • 92, 47178872, 90
        • Explanation
          • "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
    • Metadata files
      • object_types: Each row maps the MID of a Freebase object to a type it belongs to.
        • Example
          • /g/11b41c22g, /type/object/type, /people/person
        • Explanation
          • The entity with MID "/g/11b41c22g" has a type "/people/person"
      • object_names: Each row maps the MID of a Freebase object to its textual label.
        • Example
          • /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
        • Explanation
          • The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
      • object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
        • Example
          • /m/05v3y9r, /type/object/id, "/music/live_album/concert"
        • Explanation
          • The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
      • domains_id_label: Each row maps the MID of a Freebase domain to its label.
        • Example
          • /m/05v4pmy, geology, 77
        • Explanation
          • The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
      • types_id_label: Each row maps the MID of a Freebase type to its label.
        • Example
          • /m/01xljxh, /government/political_party, 147
        • Explanation
          • The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
      • entities_id_label: Each row maps the MID of a Freebase entity to its label.
        • Example
          • /g/11b78qtr5m, Viroliano Tries Jazz, 2234
        • Explanation
          • The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
        • properties_id_label: Each row maps the MID of a Freebase property to its label.
          • Example
            • /m/010h8tp2, /comedy/comedy_group/members, 47178867
          • Explanation
            • The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
        • uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.

  19. CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
    Explore at:
    application/gzip, bin, txtAvailable download formats
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
    Time period covered
    May 29, 2024
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    • Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.
    • Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.
    • Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.
    • Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:
    https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  20. Z

    Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koutra, Danai (2020). Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with Calibration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3738263
    Explore at:
    Dataset updated
    Apr 2, 2020
    Dataset provided by
    Meij, Edgar
    Koutra, Danai
    Safavi, Tara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.

    Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.

    wikidata-authors

    This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.

    The files are as follows:

    entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

    eid: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

    label: A human-readable label of this entity (extracted from Wikidata).

    relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

    rid: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:.

    label: A human-readable label of this relation (extracted from Wikidata).

    triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

    fb15krr-linked

    This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.

    The files are as follows:

    entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

    mid: The Freebase machine ID (MID) of this entity.

    wiki: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

    label: A human-readable label of this entity (extracted from Wikidata).

    types: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].

    relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

    label: The hierarchical Freebase label of this relation.

    triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sentence Transformers (2021). embedding-training-data [Dataset]. https://huggingface.co/datasets/sentence-transformers/embedding-training-data

embedding-training-data

sentence-transformers/embedding-training-data

Explore at:
121 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2021
Dataset authored and provided by
Sentence Transformers
Description

Training Data for Text Embedding Models

This repository contains raw datasets, all of which have also been formatted for easy training in the Embedding Model Datasets collection. We recommend looking there first.

This repository contains training files to train text embedding models, e.g. using sentence-transformers.

  Data Format

All files are in a jsonl.gz format: Each line contains a JSON-object that represent one training example. The JSON objects can come in… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/embedding-training-data.

Search
Clear search
Close search
Google apps
Main menu