Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
fire
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nowadays
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.
Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.
Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.
Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:
Background and Motivation
In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.
While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.
In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.
However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.
Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2
Paper: to be published
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the dataset can used for the test of models of deep learning which include structured data: stock price and unstructured data: stock bar posts. so, the dataset is Multi-source Heterogeneous Data.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
We provide an academic graph based on a snapshot of the Microsoft Academic Graph from 26.05.2021. The Microsoft Academic Graph (MAG) is a large-scale dataset containing information about scientific publication records, their citation relations, as well as authors, affiliations, journals, conferences and fields of study. We acknowledge the Microsoft Academic Graph using the URI https://aka.ms/msracad. For more information regarding schema and the entities present in the original dataset please refer to: MAG schema.
MAG for Heterogeneous Graph Learning We use a recent version of MAG from May 2021 and extract all relevant entities to build a graph that can be directly used for heterogeneous graph learning (node classification, link prediction, etc.). The graph contains all English papers, published after 1900, that have been cited at least 5 times per year since the time of publishing. For fairness, we set a constant citation bound of 100 for papers published before 2000. We further include two smaller subgraphs, one containing computer science papers and one containing medicine papers.
Nodes and features We define the following nodes:
paper with mag_id, graph_id, normalized title, year of publication, citations and a 128-dimension title embedding built using word2vec No. of papers: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
author with mag_id, graph_id, normalized name, citations No. of authors: 6,363,201 (all), 1,797,980 (medicine), 557,078 (computer science);
field with mag_id, graph_id, level, citations denoting the hierarchical level of the field where 0 is the highest-level (e.g. computer science) No. of fields: 199,457 (all), 83,970 (medicine), 45,454 (computer science);
affiliation with mag_id, graph_id, citations No. of affiliations: 19,421 (all), 12,103 (medicine), 10,139 (computer science);
venue with mag_id, graph_id, citations, type denoting whether conference or journal No. of venues: 24,608 (all), 8,514 (medicine), 9,893 (computer science).
Edges We define the following edges:
author is_affiliated_with affiliation No. of author-affiliation edges: 8,292,253 (all), 2,265,728 (medicine), 665,931 (computer science);
author is_first/last/other paper No. of author-paper edges: 24,907,473 (all), 5,081,752 (medicine), 1,269,485 (computer science);
paper has_citation_to paper No. of author-affiliation edges: 142,684,074 (all), 16,808,837 (medicine), 4,152,804 (computer science);
paper conference/journal_published_at venue No. of author-affiliation edges: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
paper has_field_L0/L1/L2/L3/L4 field No. of author-affiliation edges: 47,531,366 (all), 9,403,708 (medicine), 3,341,395 (computer science);
field is_in field No. of author-affiliation edges: 339,036 (all), 138,304 (medicine), 83,245 (computer science);
We further include a reverse edge for each edge type defined above that is denoted with the prefix rev_ and can be removed based on the downstream task.
Data structure The nodes and their respective features are provided as separate .tsv files where each feature represents a column. The edges are provided as a pickled python dictionary with schema:
{target_type: {source_type: {edge_type: {target_id: {source_id: {time } } } } } }
We provide three compressed ZIP archives, one for each subgraph (all, medicine, computer science), however we split the file for the complete graph into 500mb chunks. Each archive contains the separate node features and edge dictionary.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Ensemble Learning for Multi-type Classification in Heterogeneous NetworksIn this project you can find the following files:a) EnsembleMRSBC.zipThis file contains the systems (Mr-SBC, ST-MrSBC and MT-MrSBC), the datasets used for the experimental evaluation (they are dump databases generated with PostgreSQL 9.5) and, for each dataset, the 10 folds used for the 10-fold cross validation. Moreover, an example of configuration file for the execution of the system is included in the zip file.b) README.txtThis file contains the full instructions for the execution of the system.c) Results_EnsembleMT-MrSBC.xlsThis Excel file contains the results in terms of accuracy obtained on all datasets (according to the selected target types and their target attributes) by the systems: Mr-SBC, ST-MrSBC, MT-MrSBC (Lexicographic ordering), MT-MrSBC (Random ordering), RelIBk (RelWEKA), RelSMO (RelWEKA), HENPC and GNetMine. Results are reported for each fold and for each iteration in the case of our ensemble-based systems ST-MrSBC and MT-MrSBC (both Lexicographic and Random versions).For more details, please refer to the manuscript:F. Serafino, G. Pio, M. Ceci, "Ensemble Learning for Multi-type Classification in Heterogeneous Networks"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"Please if you use this datasets we appreciated that you reference this repository and cite the works related that made possible the generation of this dataset." This change detection datastet has different events, satellites, resolutions and includes both homogeneous/heterogeneous cases. The main idea of the dataset is to bring a benchmark on semantic change detection in remote sensing field.This dataset is the outcome of the following publications:
@article{ JimenezSierra2022graph,author={Jimenez-Sierra, David Alejandro and Quintero-Olaya, David Alfredo and Alvear-Mu{~n}oz, Juan Carlos and Ben{\'i}tez-Restrepo, Hern{\'a}n Dar{\'i}o and Florez-Ospina, Juan Felipe and Chanussot, Jocelyn},journal={IEEE Transactions on Geoscience and Remote Sensing},title={Graph Learning Based on Signal Smoothness Representation for Homogeneous and Heterogeneous Change Detection},year={2022},volume={60},number={},pages={1-16},doi={10.1109/TGRS.2022.3168126}} @article{ JimenezSierra2020graph,title={Graph-Based Data Fusion Applied to: Change Detection and Biomass Estimation in Rice Crops},author={Jimenez-Sierra, David Alejandro and Ben{\'i}tez-Restrepo, Hern{\'a}n Dar{\'i}o and Vargas-Cardona, Hern{\'a}n Dar{\'i}o and Chanussot, Jocelyn},journal={Remote Sensing},volume={12},number={17},pages={2683},year={2020},publisher={Multidisciplinary Digital Publishing Institute},doi={10.3390/rs12172683}} @inproceedings{jimenez2021blue,title={Blue noise sampling and Nystrom extension for graph based change detection},author={Jimenez-Sierra, David Alejandro and Ben{\'\i}tez-Restrepo, Hern{\'a}n Dar{\'\i}o and Arce, Gonzalo R and Florez-Ospina, Juan F},booktitle={2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS},ages={2895--2898},year={2021},organization={IEEE},doi={10.1109/IGARSS47720.2021.9555107}} @article{florez2023exploiting,title={Exploiting variational inequalities for generalized change detection on graphs},author={Florez-Ospina, Juan F and Jimenez Sierra, David A and Benitez-Restrepo, Hernan D and Arce, Gonzalo},journal={IEEE Transactions on Geoscience and Remote Sensing}, year={2023},volume={61},number={},pages={1-16},doi={10.1109/TGRS.2023.3322377}} @article{florez2023exploitingxiv,title={Exploiting variational inequalities for generalized change detection on graphs},author={Florez-Ospina, Juan F. and Jimenez-Sierra, David A. and Benitez-Restrepo, Hernan D. and Arce, Gonzalo R},year={2023},publisher={TechRxiv},doi={10.36227/techrxiv.23295866.v1}} In the table on the html file (dataset_table.html) are tabulated all the metadata and details related to each case within the dasetet. The cases with a link, were gathered from those sources and authors, therefore you should refer to their work as well. The rest of the cases or events (without a link), were obtained through the use of open sources such as:
Copernicus European Space Agency Alaska Satellite Facility (Vertex) Earth Data In addition, we carried out all the processing of the images by using the SNAP toolbox from the European Space Agency. This proccessing involves the following:
Data co-registration Cropping Apply Orbit (for SAR data) Calibration (for SAR data) Speckle Filter (for SAR data) Terrain Correction (for SAR data) Lastly, the ground truth was obtained from homogeneous images for pre/post events by drawing polygons to highlight the areas where a visible change was present. The images where layout and synchorized to be zoomed over the same are to have a better view of changes. This was an exhaustive work in order to be precise as possible.Feel free to improve and contribute to this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The heterogeneous dataset contains 2041 use case descriptions from 26 different software specifications documents written in English. Each requirement is manually measured by domain experts using COSMIC Function Point (CFP) and MicroM metrics. The extended results include evaluations with six metrics: MAE, NMAE, MSE, MMRE, PRED(30), and exact-match accuracy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Graph neural networks (GNNs) have shown great promise for representation learning on complex graph-structured data, but existing models often fall short when applied to directed heterogeneous graphs. In this study, we proposed a novel embedding method, a bidirectional heterogeneous graph neural network with random teleport (BHGNN-RT) that leverages the bidirectional message-passing process and network heterogeneity, for directed heterogeneous graphs. Our method captures both incoming and outgoing message flows, integrates heterogeneous edge types through relation-specific transformations, and introduces a teleportation mechanism to mitigate the oversmoothing effect in deep GNNs. Extensive experiments were conducted on various datasets to verify the efficacy and efficiency of BHGNN-RT. BHGNN-RT consistently outperforms state-of-the-art baselines, achieving up to 11.5% improvement in classification accuracy and 19.3% in entity clustering. Additional analyses confirm that optimizing message components, model layer and teleportation proportion further enhances the model performance. These results demonstrate the effectiveness and robustness of BHGNN-RT in capturing structural, directional information in directed heterogeneous graphs.
The SynthSOD dataset contains more than 47 hours of multitrack music obtained by synthesizing orchestra and ensemble pieces from the Symbolic Orchestral Database (SOD) using Spitfire BBC Symphony Orchestra Professional Library. To synthesize the MIDI files from the SOD, we needed to fix the original files into the General MIDI standard, select a subsect of files that fitted into our requirements (e.g., containing only instruments that we could synthesize), and develop a new system to generate musically-motivated random annotations about tempo, dynamic, and articulation. The code to replicate this process is available in our repository and all the details can be read in our paper. We have also published the code to train and evaluate the baseline and the pre-trained models in a GitHub repository.
We have also published the aligned score information for most of the pieces here.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
oag-cs
, oag-eng
, oag-chem
are new heterogeneous networks composed of subsets of the Open Academic Graph (OAG). Each of the datasets contains papers from three different subject domains -- computer science, engineering, and chemistry. These datasets also contain four types of entities -- papers, authors, institutions, and fields of study. Each paper is associated with a 768-dimensional feature vector generated from a pre-trained XLNet applying on the paper titles. The representation of each word in the title are weighted by each word's attention to get the title representation for each paper. Each paper node is labeled with its published venue (paper or conference). We split the papers published up to 2016 as the training set, papers published in 2017 as the validation set, and papers published in 2018 and 2019 as the test set. The publication year of each paper is also included in these datasets. This means those datasets can also be converted to use the publication year as class labels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the data used for “Heterogeneous Multi-Source Data Fusion Through Input Mapping And Latent Variable Gaussian Process” paper by Yigitcan Comlek, Sandipp Krishnan Ravi, Piyush Pandita, Sayan Ghosh, Liping Wang, and Wei Chen. For all correspondence, please contact Dr. Wei Chen (weichen@northwestern.edu) or Dr. Sandipp Krishnan Ravi (sandippk@umich.edu).
Please use the below BibTex format to cite this work:
@article{comlek2024heterogenous,
title={Heterogenous Multi-Source Data Fusion Through Input Mapping and Latent Variable Gaussian Process},
author={Comlek, Yigitcan and Ravi, Sandipp Krishnan and Pandita, Piyush and Ghosh, Sayan and Wang, Liping and Chen, Wei},
journal={arXiv preprint arXiv:2407.11268},
year={2024}
}
The repository consists of data used in three case studies. All the data available is in .csv format. Each csv file contains the data for the specific source used in the case study. Below is a summary of the files for each of the three case studies.
Case Study 1 (Cantilever Beam)
· Source1_RectangularBeam.csv
· Source2_RectangularHollowBeam.csv
· Source3_CircularHollowBeam.csv
Case Study 2 (Ellipsoidal Void)
· Source1_2DEllipse.csv
· Source2_3DEllipse.csv
· Source3_3DEllipseRot.csv
Case Study 3 (Ti6AlV Alloys)
· Source1_LBPF.csv [1,2]
· Source2_EBM.csv [3]
· Source3_FSW.csv [4]
For this case study the data is collected from the below papers:
[1] Q. Luo, L. Yin, T. W. Simpson, and A. M. Beese, “Effect of processing parameters on pore structures, grain features, and mechanical properties in ti-6al-4v by laser powder bed fusion,” Additive Manufacturing, vol. 56, p. 102 915, 2022.
[2] Q. Luo, L. Yin, T. W. Simpson, and A. M. Beese, “Dataset of process-structure-property feature relationship for laser powder bed fusion additive manufactured ti-6al-4v material.,” Data in Brief, vol. 46, p. 108 911, 2023.
[3] J. Ran, F. Jiang, X. Sun, Z. Chen, C. Tian, and H. Zhao, “Microstructure and mechanical properties of ti-6al-4v fabricated by electron beam melting,” Crystals, vol. 10, no. 11, p. 972, 2020.
[4] A. Fall, M. Jahazi, A. Khdabandeh, and M. Fesharaki, “Effect of process parameters on microstructure and mechanical properties of friction stir-welded ti–6al–4v joints,” The International Journal of Advanced Manufacturing Technology, vol. 91, pp. 2919–2931, 2017
Paper on this topic has been submitted to KDD 2010.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Link prediction and graph classification datasets for heterogeneous graphs in DGL format
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In exploring some of the concepts around Directed Acyclic Graphs and OLab in the assessment of clinical decision making, we have been juggling the ideas around layered and interconnected DAGs. Some of these explorations led us to the concept of heterogeneous graphs
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global heterogeneous flooring market is anticipated to register a CAGR of 8.5% during the forecast period, from 2025 to 2033. The market is expected to witness robust demand from the commercial and residential flooring segments. The increasing construction activities in emerging economies and renovation & remodeling projects in developed economies are expected to fuel the market growth. The growing adoption of sustainable and eco-friendly flooring solutions is also expected to drive the demand for heterogeneous flooring. Key trends that are shaping the heterogeneous flooring market include the rise of e-commerce, the growing popularity of luxury vinyl tiles (LVTs), and the increasing focus on sustainability. E-commerce is making it easier for consumers to purchase flooring products, which is expected to boost the market growth. LVTs are gaining popularity due to their durability, water resistance, and ease of installation. The growing focus on sustainability is driving the demand for flooring products that are made from recycled materials and are eco-friendly.
The data are population sizes of yeast Saccharaomyces cerevisiae growth in laboratory cultures over a period of several days with different levels of growth inhibitor cycloheximide. Our results provide rigorous experimental tests of new and old theory, demonstrating how the traditional notion of carrying capacity is ambiguous for populations diffusing in spatially heterogeneous environments.
https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/
Global Heterogeneous Network Market size was valued at USD 28.66 billion in 2021 and is poised to grow from USD 32.53 billion in 2022 to USD 101.6 billion by 2030, growing at a CAGR of 13.49% in the forecast period (2023-2030).
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global heterogeneous networks market size was valued at approximately $15 billion in 2023 and is projected to reach around $37 billion by 2032, growing at a compound annual growth rate (CAGR) of 10.8% during the forecast period. The primary growth factor for this market is the increasing demand for high-speed internet and improved network coverage, driven by the rapid proliferation of connected devices and the expansion of smart city initiatives worldwide.
The growth of the heterogeneous networks market is significantly influenced by the escalating need for enhanced data capacity and coverage. With the exponential growth in mobile data traffic, largely fueled by the adoption of smartphones, tablets, and other connected devices, traditional cellular networks are struggling to meet the demands. Heterogeneous networks, which combine various types of network technologies such as small cells, Wi-Fi, and macro cells, provide a viable solution to address these challenges by offering seamless connectivity and increased data throughput.
Another major growth factor for the heterogeneous networks market is the advancement in wireless communication technologies, particularly the deployment of 5G networks. 5G technology promises to deliver faster data speeds, lower latency, and more reliable connections, which are essential for supporting the growing number of Internet of Things (IoT) devices and applications. The integration of heterogeneous networks with 5G infrastructure is expected to enhance network performance and coverage, thereby driving the market growth.
Additionally, the market is being propelled by the increasing investments in smart cities and smart infrastructure projects. Governments and municipalities around the world are investing heavily in smart city initiatives to improve urban living conditions and enhance the efficiency of public services. Heterogeneous networks play a crucial role in these projects by providing the necessary connectivity for smart devices and applications, such as smart lighting, traffic management systems, and surveillance cameras, thus driving the market expansion.
From a regional perspective, the Asia Pacific region is anticipated to witness the highest growth in the heterogeneous networks market during the forecast period. This growth can be attributed to the rapid urbanization, increasing population, and the rising adoption of smart devices in countries like China, India, and Japan. In addition, significant investments in infrastructure development and the rollout of 5G networks in these countries are expected to further boost the demand for heterogeneous networks in the region.
In the heterogeneous networks market, the component segment is broadly categorized into hardware, software, and services. The hardware segment includes various physical devices and equipment such as small cells, macro cells, distributed antenna systems (DAS), and Wi-Fi access points, which form the backbone of heterogeneous networks. The growth of this segment is driven by the increasing deployment of small cells and DAS to enhance network capacity and coverage in urban and densely populated areas. Moreover, the rising adoption of 5G technology is further boosting the demand for advanced hardware components capable of supporting higher data speeds and lower latency.
The software segment encompasses various network management and optimization software solutions that enable seamless integration and coordination of different network technologies. These solutions play a critical role in ensuring efficient network performance, minimizing interference, and optimizing resource allocation. The growing complexity of heterogeneous networks necessitates advanced software solutions to manage and control the network infrastructure effectively. Consequently, the software segment is expected to experience robust growth during the forecast period, driven by the increasing need for efficient network management and optimization.
Services in the heterogeneous networks market include planning, deployment, maintenance, and managed services offered by network service providers and system integrators. As the deployment of heterogeneous networks involves significant technical expertise and resources, the demand for professional services is on the rise. Network operators and enterprises are increasingly relying on service providers for the design and implementation of their network infrastructure, as well as for ongoing maintenance and support. This trend is expected to drive the g
https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy
Global Heterogeneous Integration market size is expected to reach $3.01 billion by 2029 at 30.7%, surge in iot adoption fueling the growth of the market due to increased demand for connectivity and automation
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
fire