Training Data for Text Embedding Models
This repository contains raw datasets, all of which have also been formatted for easy training in the Embedding Model Datasets collection. We recommend looking there first.
This repository contains training files to train text embedding models, e.g. using sentence-transformers.
Data Format
All files are in a jsonl.gz format: Each line contains a JSON-object that represent one training example. The JSON objects can come in… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/embedding-training-data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for investopedia-embedding dataset
We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.
Dataset Description… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Copyright 2020 Riccardo CAPPUZZO
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The datasets contained in this directory were used while working with EmbDI on the relevant paper. Please refer to the full repository for more info.
What is provided here was sourced mostly from The Magellan Data Repository. For each dataset, three tables are provided: table-A and table-B are taken from the original repository and slightly modified (lower casing, spaces were replaced by _
, some special characters were removed), while the third table is the concatenation of tables A and B.
Edgelists are the data structures used by EmbDI. They are generated starting from each concatenated dataset and are then fed to the algorithm.
The EQ tests
folder contains all the tests used to perform the Embeddings Quality evaluation in the paper.
The additional resources include:
(partially preprocessed) base datasets.
Their Entity Resolution and Schema Matching versions.
Edgelists for both ER and SM versions.
Ground truth files for ER and SM tasks.
Test directories for the EQ task.
Copies of the configuration files provided in this repository.
Configuration files, info files and ER match files were left in this repository in pipeline/config_files/default
,
pipeline/info
and pipeline/matches/default
.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
The Embeddings extension enhances CKAN's search and discovery capabilities by leveraging machine learning embeddings. It encodes dataset metadata into numerical vectors, allowing for similarity-based comparisons between datasets and enabling semantic search functionality. This approach goes beyond simple keyword matching to consider the meaning and context of dataset information, ultimately improving data discoverability for users. Key Features: Similar Dataset Recommendations: Computes and ranks dataset embeddings against a selected dataset, returning the most semantically similar datasets. The number of returned datasets is configurable. Semantic Search: Ranks dataset embeddings against a user-provided query term to find datasets that are semantically similar to the search query.This provides a "Dense Vector Search" capability within the Solr search engine. Configurable Embedding Backends: Supports multiple backends for generating embeddings, including a local Sentence Transformers model and OpenAI's Embeddings API. Users can also implement custom backends. Customizable Solr Integration: The extension requires a custom Solr schema with a Dense Vector Search field to support semantic search and allows configuration of a different vector field to test out different models. Pluggable Embedding Model: Supports and enables the use of different embedding models, however model dimensions need to match Solr dimensions. Technical Integration: The Embeddings extension integrates with CKAN through a plugin architecture, adding a packagesimilarshow action to the CKAN API that returns similar datasets. It also modifies the package_search action to offer semantic search capabilities. The extension requires configuration settings in the CKAN ini file to specify the embedding backend, API keys (if applicable), and Solr vector field name. A custom Solr schema, provided via a Dockerfile, is needed to enable the Dense Vector Search functionality. The plugin also enables the ability to use a custom query parser in Solr. Benefits & Impact: The Embeddings extension improves data discovery within CKAN by providing a more sophisticated search mechanism that considers the semantic meaning of dataset metadata. This leads to users finding more relevant datasets, even if their search terms don't explicitly match keywords in the metadata. The ability to use different embedding backends and customize the Solr integration provides flexibility for different use cases and environments for the CKAN installation.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This CSV dataset contains the Title, Abstract,... and Embeddings of today's arXiv papers. It gets updated everyday (There are no new papers on arXiv on Saturdays and Sundays and Holidays) The notebook that calculates the Embeddings is Here. You can find the complete code to reproduce this dataset on our GitHub: https://github.com/orxaicom/daily-arxiv-embeddings We use this to visualize the arXiv papers everyday, Check it out on our website: https://www.orxai.com
The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.
Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5
.
The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.
To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.
The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise
from datasets import load_dataset from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '
oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming
to False
if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).
sample = list(itertools.islice(oale, 100000))
query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)
similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]
print(most_similar['text']) ```
To speed up the loading of the Embeddings, you may wish to install orjson
.
The Embeddings are stored in data/embeddings.jsonl
, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl
and the corresponding texts are located in data/texts.jsonl
.
The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text
field, which was removed, and with the addition of the is_last_chunk
key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).
All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5
's tokeniser) with the semchunk
Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format:
perl
Title: {title}
Jurisdiction: {jurisdiction}
Type: {type}
{text}
The chunks were then vectorised by bge-small-en-v1.5
on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers
library.
The resulting embeddings were serialised as json-encoded lists of floats by orjson
and stored in data/embeddings.jsonl
. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl
and data/texts.jsonl
, respectively.
The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.
https://ai.tencent.com/ailab/nlp/embedding.html
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outline
This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)
Video data that simulates daily life actions in a virtual space from Scenario Data.
Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).
Knowledge Graph Embedding Data are created for reasoning based on machine learning
This data is open to the public as open data
Details
Videos
mp4 format
203 action scenarios
For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos
Videos with slowly moving characters simulate the movements of elderly people.
Knowledge Graphs
RDF format
203 knowledge graphs corresponding to the videos
Includes schema and location supplement information
The schema is described below
SPARQL endpoints and query examples are available
Script Data
txt format
Data provided to VirtualHome2KG to generate videos and knowledge graphs
Includes the action title and a brief description in text format.
Embedding
Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)
Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).
Specification of Ontology
Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm
Related Resources
KGRC4SI Final Presentations with automatic English subtitles (YouTube)
VirtualHome2KG (Software)
VirtualHome-AIST (Unity)
VirtualHome-AIST (Python API)
Visualization Tool (Software)
Script Editor (Software)
We provide the code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models. With the dense passage retriever (DPR) [1], we encode text snippets from the C4 dataset [2] to generate 768-dimensional vectors:
context DPR embeddings for the base set and question DPR embeddings for the query set.
The metric for similarity search is inner product [1].
The number of base and query embedding vectors is parametrizable.
See the main repository for details on how to generate the DPR10M specific instance introduced in [3].
[1] Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)
[2] Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)
[3] Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed indices. In: Proceedings of the VLDB Endowment, 16, 11 (2023)
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with some in Russian (325,952 articles). Keywords are included for articles after 2015.
The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets - please see README files inside those zip files.
The main archive contains JSON files of all the Estonian articles from the year 2009 to 2019 May. These datasets are intended for usage in EMBEDDIA, a H2020 project. Articles are in Estonian language with some in Russian.
The main archive is in file ee_*articles_*2009_2019. Other files contain derived versions and subsets (please see README files inside those zip files), in short:
- eearticles2015-2019: This dataset contains Estonian and Russian articles - 5 years, with tags, that were missing in the previous versions.
- files eearticles20152019lemmatized and eearticles20092014lemmatized are the files preprocessed by TEXTA (contact linda@texta.ee)
- in file eeandsttarticlelemmasembeddingsand_usage you can find w2v embeddings trained by TEXTA (contact linda@texta.ee)
Description of the Main Dataset (eearticles_2009_2019)
There are 12 JSON files:
articles_2009_ver2.json contains 161394 articles from the year 2009
articles_2010_ver2.json contains 151033 articles from the year 2010
articles_2011_ver2.json contains 168273 articles from the year 2011
articles_2012_ver2.json contains 152772 articles from the year 2012
articles_2013_ver2.json contains 141012 articles from the year 2013
articles_2014_ver2.json contains 128388 articles from the year 2014
articles_2015_ver2.json contains 127425 articles from the year 2015
articles_2016_ver2.json contains 130704 articles from the year 2016
articles_2017_ver2.json contains 119318 articles from the year 2017
articles_2018_ver2.json contains 117388 articles from the year 2018
articles_2019_Jan-Apr_ver2.json contains 35076 articles from the year 2019 January to April
articles_2019_May_ver2.json contains 8329 articles from the year 2019 May
In sum: 1 441 112 articles
Each JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following:
id (integer) - the ID of the article
title (string) - the title of the article
lead (string) - the lead of the article (can contain HTML, e.g. tag)
url (string) - the URL of the article
domain_id (string) [2] - the ID of the domain
id (string) - the ID of the tag
lang (string) - the language of the tag
tag (string) - the tag itself, e.g. Kert Kingo (a name)
translitted_name (string) - a modified version of the tag, e.g. kert-kingo
rawBody (string) - the raw text of the article (contains HTML)
bodyText (string) - clean article text (stripped from HTML)
publishDate (string) - published date & time of the article
categoryPrimary (dictionary or empty list) - the dictionary contains the following information:
categoryId (integer) - the ID of the category
categoryName (string)- the name of the category (e.g. World)
channelId (integer) - the ID of the channel
articleId (integer) - the ID of the article
categoryId (integer) - the ID of the category
categoryName (string)- the name of the category (e.g. World)
categoryPrimary (boolean) - unknown
categorySort (integer) - unknown
categoryUrl (string) - the URL of the category
categoryVisible (boolean) - unknown
channelId (integer) - the ID of the channel
channelUrl (string) - the URL of the channel (e.g. 'https://sport.delfi.ee')
directoryName (string) - unknown
channelLanguage (string or None) [3] - the language of the channel
categoryLanguage (int or None) [4] -unknown
commentCount (int) [5] - the number of comments
relatedArticles (list of integers) - a list of related articles' ID's
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective
To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.
Materials and Methods
We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly-released pretrained embeddings and alternative training methodologies.
Results
We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% Normalized Mutual Information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.
Discussion
We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.
Methods This dataset includes trained medical concept embeddings for 5428 ICD-10 codes and 394 Clinical Classification Software (Revised) (CCSR) categories. We include several different sets of concept embeddings, each trained using a slightly different set of hyperparameters and algorithms.
To train our models, we employed data from the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system. KPMAS is an integrated medical system serving approximately 780,000 members in Maryland, Virginia, and the District of Columbia. KPMAS has a comprehensive Electronic Medical Record system which includes data from all patient interactions with primary or specialty caregivers, from which all data is derived. Our embeddings training set included diagnoses allocated to all adult patients in calendar year 2019.
For each code, we also recovered an associated category, as assigned by the Clinical Classification Software (Revised).
We trained 12 sets of embeddings using classical Word2Vec models with settings differing across three parameters. Our first parameter was the selection of training algorithm, where we trained both CBOW and SG models. Each model was trained using dimension k of 10, 50, and 100. Furthermore, each model-dimension combination was trained with categories and codes trained separately and together (referred to hereafter as ‘co-trained embeddings’ or ‘co-embeddings’). Each model was trained for 10 iterations. We employed an arbitrarily large context window (100), since all codes necessarily occurred within a short period (1 year).
We also trained a set of validation embeddings only on ICD-10 codes using the Med2Vec architecture as a comparison. We trained the Med2Vec model on our data using its default settings, including the default vector size (200) and a training regime of 10 epochs. We grouped all codes occurring on the same calendar date as Med2Vec ‘visits.’ Our Med2Vec model benchmark did not include categorical entities or other novel innovations.
Word2Vec embeddings were generated using the GenSim package in Python. Med2Vec embeddings were generated using the Med2Vec code published by Choi. The JSON files used in this repository were generated using the JSON package in Python.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data required to reproduce results in "Electrostatic Embedding of Machine Learning Potentials" article. See https://github.com/emedio/embedding for details.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
These are the embeddings powering searchthearxiv.com, a semantic search engine for more than a decade's worth of ML papers published on arXiv. The embeddings are created by running OpenAI's text-embedding-ada-002
model on an "augmented" abstract for each paper (title+authors+year+abstract). The papers are sourced from the metadataset published by Cornell University and filtered to include only papers belonging to at least one of the following categories:
cs.cv
, cs.lg
, cs.cl
, cs.ai
, cs.ne
, and cs.ro
The dataset is updated on a weekly basis (in lockstep with the official arXiv metadataset).
If you find the dataset and/or searchthearxiv.com useful, consider giving the repo a ⭐️ on GitHub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we included code to prepare dataset, train gemnet model, build the faiss index, search the faiss index and visualize the searched results in the notebook faiss-gemnet-qm9-mp.ipynb
. It reproduced our examples in the manuscript for the QM9 and the Materials Project dataset. For the OC20 dataset, we did not include its related data here because of its large size (> 50 GB), the code to process the OC20 dataset is almost the same as the code included in the notebook for the QM9 dataset.
We include the intermediate data (GemNet checkpoints, lmdb, faiss index and the searched result for the QM9 and the Materials project in the directory example-data
. We also put the GemNet checkpoint for the OC20 dataset in this directory. The training and evaluation of the Gaussian regression process model using the searched molecules for the query Benzene are demonstrated in the ben-gp-data
directory, in which the qm9-gp-gemnet-morgan-random-nrg.ipynb
can be run on Colab.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We gathered Bangla math dataset with solutions and created embeddings. It can be used to find similar problems while doing RAG or few shot prompting just but loading it and doing the faiss index search. The dataset (which was used to do the embedding) has a size of 126675 problems. Here's the dataset link: https://www.kaggle.com/datasets/sourav2083/math-bangla/data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DBpedia graph embeddings using RDF2Vec. RDF2Vec embedding generation code can be found here and is based on a publication by Portisch et al. [1].
The embeddings dataset consists of 200-dimensional vectors of DBpedia entities (from 1/9/2021).
Generating Embeddings
The code for generating these embeddings can be found here.
Run the run.sh script that wraps all the necessary commmands to generate embeddings
bash run.sh
The script downloads a set of DBpedia files, which are listed in dbpedia_files.txt. It then builds a Docker image and runs a container of that image that generates the embeddings for the DBpedia graph defined by the DBpedia files.
A folder files is created containing all the downloaded DBpedia files, and a folder embeddings/dbpedia is created containing the embeddings in vectors.txt along a set of random walk files.
Run Time of Embeddings Generation
Generating embeddings can take more than a day, but it depends on the number of DBpedia files chosen to be downloaded. Following are some basic run time statistics when embeddings are generated on a 64 GB RAM, 8 cores (AMD EPYC), 1 TB SSD, 1996.221 MHz machine.
Total: 1 day, 8 hours, 52 minutes, 41 seconds
Walk generation: 0 days, 7 minutes, 24 minutes, 36 seconds
Training: 1 day, 1 hour, 28 minutes, 5 seconds
Parameters Used
Here is listed the parameters used to generate the embeddings provided here:
Number of walks per entity: 100
Depth (hops) per walk: 4
Walk generation mode: RANDOM_WALKS_DUPLICATE_FREE
Threads: # of processors / 2
Training mode: sg
Embeddings vector dimension: 200
Minimum word2vec word count: 1
Sample rate: 0.0
Training window size: 5
Training epochs: 5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.
Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.
Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.
Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.
Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:
Background and Motivation
In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.
While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.
In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.
However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.
Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2
Paper: to be published
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.
Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.
wikidata-authors
This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.
The files are as follows:
entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:
eid: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.
label: A human-readable label of this entity (extracted from Wikidata).
relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:
rid: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:.
label: A human-readable label of this relation (extracted from Wikidata).
triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .
fb15krr-linked
This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.
The files are as follows:
entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:
mid: The Freebase machine ID (MID) of this entity.
wiki: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.
label: A human-readable label of this entity (extracted from Wikidata).
types: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].
relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:
label: The hierarchical Freebase label of this relation.
triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .
Training Data for Text Embedding Models
This repository contains raw datasets, all of which have also been formatted for easy training in the Embedding Model Datasets collection. We recommend looking there first.
This repository contains training files to train text embedding models, e.g. using sentence-transformers.
Data Format
All files are in a jsonl.gz format: Each line contains a JSON-object that represent one training example. The JSON objects can come in… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/embedding-training-data.