58 datasets found

Sample Graph Datasets in CSV Format

zenodo.org

csv

Updated Dec 9, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14335015

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
dataset_30_edges_interactions.csv: contains 47 rows (edges).
the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the Column	Type	Description
UniProt ID	string	protein identification
label	string	protein label (type of node)
properties	string	a dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the Column	Type	Description
Relationship ID	string	relationship identification
Source ID	string	identification of the source protein in the relationship
Target ID	string	identification of the target protein in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_30*	30	47	Y
dataset_60*	60	181	Y
dataset_120*	120	689	Y
dataset_240*	240	2819	Y
dataset_300*	300	4658	Y
dataset_600*	600	18004	Y
dataset_1200*	1200	71785	Y
dataset_2400*	2400	288600	Y
dataset_3000*	3000	449727	Y
dataset_6000*	6000	1799413	Y
dataset_12000*	12000	7199863	Y
dataset_24000*	24000	28792361	Y
dataset_30000*	30000	44991744	Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	node identification
label	string	node label (type of node)
properties	string	a dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	relationship identification
source	string	identification of the source node in the relationship
target	string	identification of the target node in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata (tiny graphs)

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_dummy*	3	6	N
dataset_dummy2*	3	6	N

NASS Data Visualization
agdatacommons.nal.usda.gov
bin
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA National Agricultural Statistics Service (2024). NASS Data Visualization [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/NASS_Data_Visualization/24660801
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
National Agricultural Statistics Servicehttp://www.nass.usda.gov/
Authors
USDA National Agricultural Statistics Service
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
NASS Data Visualization provides a dynamic web query interface supporting searches by Commodity (e.g. Cotton, Corn, Farms & Land, Grapefruit, Hogs, Oranges, Soybeans, Wheat), Statistic type (automatically refreshed based upon choice of Commodity - e.g. Inventory, Head, Acres Planted, Acres Harvested, Production, Yield) to generate chart, table, and map visualizations by year (2001-2016), as well as a link to download the resulting data in CSV format compatible for updating databases and spreadsheets. Resources in this dataset:Resource Title: NASS Data Visualization web site. File Name: Web Page, url: https://nass.usda.gov/Data_Visualization/index.php Query interface with visualization of results as charts, tables, and maps.
T
Sweden - Distribution of population by household types: Three or more adults...
tradingeconomics.com
csv, excel, json, xml
Updated Aug 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2020). Sweden - Distribution of population by household types: Three or more adults [Dataset]. https://tradingeconomics.com/sweden/distribution-of-population-by-household-types-three-or-more-adults-eurostat-data.html
Explore at:
excel, xml, json, csvAvailable download formats
Dataset updated
Aug 26, 2020
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1976 - Dec 31, 2025
Area covered
Sweden
Description
Sweden - Distribution of population by household types: Three or more adults was 3.00% in December of 2024, according to the EUROSTAT. Trading Economics provides the current actual value, an historical data chart and related indicators for Sweden - Distribution of population by household types: Three or more adults - last updated from the EUROSTAT on March of 2025. Historically, Sweden - Distribution of population by household types: Three or more adults reached a record high of 3.70% in December of 2017 and a record low of 2.50% in December of 2009.
d
Grammar transformations of topographic feature type annotations of the U.S....
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
Explore at:
Dataset updated
Jul 20, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.
d
Device Graph Data | 10+ Identity Types | 1500M+ Global Devices| CCPA...
datarade.ai
Updated Aug 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DRAKO (2024). Device Graph Data | 10+ Identity Types | 1500M+ Global Devices| CCPA Compliant [Dataset]. https://datarade.ai/data-products/drako-device-graph-data-usa-canada-comprehensive-insi-drako
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Aug 21, 2024
Dataset authored and provided by
DRAKO
Area covered
Philippines, Mozambique, Bahamas, South Sudan, Brazil, Eritrea, Cyprus, Lao People's Democratic Republic, Tonga, Aruba
Description
DRAKO is a leader in providing Device Graph Data, focusing on understanding the relationships between consumer devices and identities. Our data allows businesses to create holistic profiles of users, track engagement across platforms, and measure the effectiveness of advertising efforts.

Device Graph Data is essential for accurate audience targeting, cross-device attribution, and understanding consumer journeys. By integrating data from multiple sources, we provide a unified view of user interactions, helping businesses make informed decisions.

Key Features: - Comprehensive device mapping to understand user behaviour across multiple platforms - Detailed Identity Graph Data for cross-device identification and engagement tracking - Integration with Connected TV Data for enhanced insights into video consumption habits - Mobile Attribution Data to measure the effectiveness of mobile campaigns - Customizable analytics to segment audiences based on device usage and demographics - Some ID types offered: AAID, idfa, Unified ID 2.0, AFAI, MSAI, RIDA, AAID_CTV, IDFA_CTV

Use Cases: - Cross-device marketing strategies - Attribution modelling and campaign performance measurement - Audience segmentation and targeting - Enhanced insights for Connected TV advertising - Comprehensive consumer journey mapping

Data Compliance: All of our Device Graph Data is sourced responsibly and adheres to industry standards for data privacy and protection. We ensure that user identities are handled with care, providing insights without compromising individual privacy.

Data Quality: DRAKO employs robust validation techniques to ensure the accuracy and reliability of our Device Graph Data. Our quality assurance processes include continuous monitoring and updates to maintain data integrity and relevance.
Z
CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...
data.niaid.nih.gov
zenodo.org
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Granroth-Wilding (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957401
Explore at:
Dataset updated
Jun 4, 2024
Dataset provided by
Mark Granroth-Wilding
Dhiana Deva Cavacanti Rocha
Armin Catovic
Drew McCornack
Lele Cao
Vilhelm von Ehrenheim
Richard Anselmo Stahl
Description
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2

Paper: to be published
Z
Data from: Entity Typing Datasets
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Russa Biswas (2023). Entity Typing Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7688589
Explore at:
Dataset updated
Mar 2, 2023
Dataset authored and provided by
Russa Biswas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are the datasets used in the Entity Type Prediction task for Knowledge Graph Completion.

DB630k_Fine-grained_Hierarchical.zip dataset has been used in the papers [1] and [2]. It is an extended version of DBpedia630k dataset originally created for Text classification and is available here.

FIGER.zip dataset has also been used in the papers [1] and [2].

MultilingualETdata.zip dataset has been used in the paper [3]

NamesETdata.zip dataset has been used in the paper [4]. The CaLiGraph test dataset can also be downloaded here.

[1] Biswas R, Sofronova R, Sack H, Alam M. Cat2type: Wikipedia category embeddings for entity typing in knowledge graphs. InProceedings of the 11th on Knowledge Capture Conference 2021 Dec 2 (pp. 81-88).

[2] Biswas R, Portisch J, Paulheim H, Sack H, Alam M. Entity type prediction leveraging graph walks and entity descriptions. In The Semantic Web–ISWC 2022: 21st International Semantic Web Conference, Virtual Event, October 23–27, 2022, Proceedings 2022 Oct 16 (pp. 392-410). Cham: Springer International Publishing.

[3] Biswas R, Chen Y, Paulheim H, Sack H, Alam M. It’s All in the Name: Entity Typing Using Multilingual Language Models. In The Semantic Web: ESWC 2022 Satellite Events: Hersonissos, Crete, Greece, May 29–June 2, 2022, Proceedings 2022 Jul 20 (pp. 36-41). Cham: Springer International Publishing.

[4] Biswas R, Sofronova R, Alam M, Heist N, Paulheim H, Sack H. Do judge an entity by its name! entity typing using language models. In The Semantic Web: ESWC 2021 Satellite Events: Virtual Event, June 6–10, 2021, Revised Selected Papers 18 2021 (pp. 65-70). Springer International Publishing.
Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
Data from: CKGG: A Chinese Knowledge Graph for High-School Geography...
data.subak.org
zenodo.org
csv
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State Key Laboratory for Novel Software Technology, Nanjing University, China (2023). CKGG: A Chinese Knowledge Graph for High-School Geography Education and Beyond [Dataset]. https://data.subak.org/dataset/ckgg-a-chinese-knowledge-graph-for-high-school-geography-education-and-beyond
Explore at:
csvAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
Nanjing Universityhttp://www.nju.edu.cn/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of files

admin.nt.gz: part of administrative divisions (country, admin1, admin2, admin3, admin4)

admin_code.nt.gz: administrative division code

altitude.nt.gz: altitude in meters

belong_by_admin.nt.gz: "part of lowest administrative divisions" triples converted to triples with "part of" predicate

climate.nt.gz: climate type of location, also metadata about climate types (name, description, polygon)

cn_stat_data.nt.gz: statistical data of province from National Bureau of Statistics of China

coord.nt.gz: coordinate (latitude and longitude)

entity_rank_score.nt.gz: ranking score of entity

feature_belong.nt.gz: part of calculated using polygons of places

geonames_code.nt.gz: geonames feature code

gis.nt.gz: polygons of place

globalsolaraltas.nt.gz: annual solar radiation data in kWh/m2

label.nt.gz: labels

node_align.nt.gz: alignment to geonames and wikidata

ocean_current.nt.gz: influenced by ocean current, also metadata about ocean currents (name, linestring)

pop.nt.gz: population

precip_xxx.nt.gz: monthly and annual precipitation data of entities in meters

temperature_xxx.nt.gz: monthly and annual temperature data of entities in celsius degree

to_ocean_distance.nt.gz: location distance to ocean in meters

types.nt.gz: rdf:type of locations

tz.nt.gz: metadata of timezones (label, offset to gmt)

tz_loc.nt.gz: timezone of locations

wikilinks.nt.gz: wikipedia links of locations

Ontology

ontology.owl: CKGG ontology

mapping.owl: CKGG ontology mapping to other ontologies (Clinga, DBpedia, GeoSPARQL, SKOS)

For more details, please refer to http://w3id.org/ckgg
Wikipedia Knowledge Graph dataset
zenodo.org
explore.openaire.eu
+1more
pdf, tsv
Updated Jul 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
Data from: OpenAIRE Graph Beginner's Kit Dataset
zenodo.org
pub.uni-bielefeld.de
tar
Updated Aug 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou (2023). OpenAIRE Graph Beginner's Kit Dataset [Dataset]. http://doi.org/10.5281/zenodo.8223812
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8223812
Dataset updated
Aug 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The OpenAIRE Graph is an Open Access dataset containing metadata about research products (literature, datasets, software, etc.) linked to other entities of the research ecosystem like organisations, project grants, and data sources.

The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.

The OpenAIRE Beginner’s Kit aims to address this issue. It consists of two components:

A subset of the OpenAIRE Graph composed of the research products published between 2022-12-28 and 2023-07-31, all the entities connected to them and the respective relationships. The subset is composed of the following parts:

publication.tar: metadata records about research literature (includes types of publications listed here)

dataset.tar: metadata records about research data (includes the subtypes listed here)

software.tar: metadata records about research software (includes the subtypes listed here)

otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)

organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.

datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.

project.tar: metadata records about project grants.

relation.tar: metadata records about relations between entities in the graph.

communities_infrastructures.tar: metadata records about research communities and research infrastructures

Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.8238874.

The code to analyse the data. It is available on GitHub. Just download the archive, unzip/untar it and follow the instruction on the README file (no need to clone the GitHub repository)
Semantic Knowledge Graphing Market is Growing at a CAGR of 14.80% from 2024...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2024). Semantic Knowledge Graphing Market is Growing at a CAGR of 14.80% from 2024 to 2031. [Dataset]. https://www.cognitivemarketresearch.com/semantic-knowledge-graphing-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Mar 4, 2024
Dataset provided by
Decipher Market Research
Authors
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global semantic knowledge graphing market size is USD 1512.2 million in 2024 and will expand at a compound annual growth rate (CAGR) of 14.80% from 2024 to 2031.

North America held the major market of around 40% of the global revenue with a market size of USD 604.88 million in 2024 and will grow at a compound annual growth rate (CAGR) of 13.0% from 2024 to 2031. Europe accounted for a share of over 30% of the global market size of USD 453.66 million. Asia Pacific held the market of around 23% of the global revenue with a market size of USD 347.81 million in 2024 and will grow at a compound annual growth rate (CAGR) of 16.8% from 2024 to 2031. Latin America market of around 5% of the global revenue with a market size of USD 75.61 million in 2024 and will grow at a compound annual growth rate (CAGR) of 14.2% from 2024 to 2031. Middle East and Africa held the major market of around 2% of the global revenue with a market size of USD 30.24 million in 2024 and will grow at a compound annual growth rate (CAGR) of 14.5% from 2024 to 2031. The natural language processing knowledge graphing held the highest growth rate in semantic knowledge graphing market in 2024.

Market Dynamics of Semantic Knowledge Graphing Market

Key Drivers of Semantic Knowledge Graphing Market

Growing Volumes of Structured, Semi-structured, and Unstructured Data to Increase the Global Demand

The global demand for semantic knowledge graphing is escalating in response to the exponential growth of structured, semi-structured, and unstructured data. Enterprises are inundated with vast amounts of data from diverse sources such as social media, IoT devices, and enterprise applications. Structured data from databases, semi-structured data like XML and JSON, and unstructured data from documents, emails, and multimedia files present significant challenges in terms of organization, analysis, and deriving actionable insights. Semantic knowledge graphing addresses these challenges by providing a unified framework for representing, integrating, and analyzing disparate data types. By leveraging semantic technologies, businesses can unlock the value hidden within their data, enabling advanced analytics, natural language processing, and knowledge discovery. As organizations increasingly recognize the importance of harnessing data for strategic decision-making, the demand for semantic knowledge graphing solutions continues to surge globally.

Demand for Contextual Insights to Propel the Growth

The burgeoning demand for contextual insights is propelling the growth of semantic knowledge graphing solutions. In today's data-driven landscape, businesses are striving to extract deeper contextual meaning from their vast datasets to gain a competitive edge. Semantic knowledge graphing enables organizations to connect disparate data points, understand relationships, and derive valuable insights within the appropriate context. This contextual understanding is crucial for various applications such as personalized recommendations, predictive analytics, and targeted marketing campaigns. By leveraging semantic technologies, companies can not only enhance decision-making processes but also improve customer experiences and operational efficiency. As industries across sectors increasingly recognize the importance of contextual insights in driving innovation and business success, the adoption of semantic knowledge graphing solutions is poised to witness significant growth. This trend underscores the pivotal role of semantic technologies in unlocking the true potential of data for strategic advantage in today's dynamic marketplace.

Restraint Factors Of Semantic Knowledge Graphing Market

Stringent Data Privacy Regulations to Hinder the Market Growth

Stringent data privacy regulations present a significant hurdle to the growth of the Semantic Knowledge Graphing market. Regulations such as GDPR (General Data Protection Regulation) in Europe and CCPA (California Consumer Privacy Act) in the United States impose strict requirements on how organizations collect, store, process, and share personal data. Compliance with these regulations necessitates robust data protection measures, including anonymization, encryption, and access controls, which can complicate the implementation of semantic knowledge graphing systems. Moreover, concerns about data breach...
T
Spain - Distribution of population by household types: Single person
tradingeconomics.com
csv, excel, json, xml
Updated Sep 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2020). Spain - Distribution of population by household types: Single person [Dataset]. https://tradingeconomics.com/spain/distribution-of-population-by-household-types-single-person-eurostat-data.html
Explore at:
csv, json, xml, excelAvailable download formats
Dataset updated
Sep 16, 2020
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1976 - Dec 31, 2025
Area covered
Spain
Description
Spain - Distribution of population by household types: Single person was 11.30% in December of 2024, according to the EUROSTAT. Trading Economics provides the current actual value, an historical data chart and related indicators for Spain - Distribution of population by household types: Single person - last updated from the EUROSTAT on March of 2025. Historically, Spain - Distribution of population by household types: Single person reached a record high of 11.30% in December of 2024 and a record low of 8.40% in December of 2009.
d
Area Age Gender Statistics Chart - Epidemic Typhus - Statistics by Onset...
data.gov.tw
csv, json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control, Area Age Gender Statistics Chart - Epidemic Typhus - Statistics by Onset Date (in months) [Dataset]. https://data.gov.tw/en/datasets/8671
Explore at:
json, csvAvailable download formats
Dataset authored and provided by
Centers for Disease Control
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
Statistical table of the number of cases by region, age group, and gender since 2003 (Disease name: Scrub typhus, Date type: Onset date, Case type: Confirmed case, Source of infection: Domestic, Imported).
T
Hungary - Distribution of population by household types: Single person
tradingeconomics.com
csv, excel, json, xml
Updated Sep 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2020). Hungary - Distribution of population by household types: Single person [Dataset]. https://tradingeconomics.com/hungary/distribution-of-population-by-household-types-single-person-eurostat-data.html
Explore at:
xml, json, excel, csvAvailable download formats
Dataset updated
Sep 15, 2020
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1976 - Dec 31, 2025
Area covered
Hungary
Description
Hungary - Distribution of population by household types: Single person was 13.80% in December of 2023, according to the EUROSTAT. Trading Economics provides the current actual value, an historical data chart and related indicators for Hungary - Distribution of population by household types: Single person - last updated from the EUROSTAT on March of 2025. Historically, Hungary - Distribution of population by household types: Single person reached a record high of 14.50% in December of 2017 and a record low of 9.20% in December of 2010.
w
Part I and Part II crimes bar chart
data.wu.ac.at
csv, json, xml
Updated Aug 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
County of San Mateo Sheriff's Office (2016). Part I and Part II crimes bar chart [Dataset]. https://data.wu.ac.at/schema/performance_smcgov_org/bnZoNi01OW1m
Explore at:
csv, json, xmlAvailable download formats
Dataset updated
Aug 27, 2016
Dataset provided by
County of San Mateo Sheriff's Office
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Counts of Part I committed in San Mateo County from 1985 on. This dataset also includes Part II crimes from 2013 on.

Part I crimes include: homicide, rape, robbery, aggravated assault, burglary, motor vehicle theft, larceny-theft, and arson. These counts include crimes committed at San Francisco International Airport (SFO), Unincorporated San Mateo County, Woodside, Portola Valley, San Carlos from 10/31/10 forward; Half Moon Bay from 6/12/11 forward; and Millbrae from 3/4/12 forward.

Part II crimes do not include San Francisco International Airport (SFO) cases and is an estimate only. An estimate is required because there are no specific data types used when keying in Type II crime types. Therefore, Records Manager judgment is used.
Κ
The Enhanced Microsoft Academic Knowledge Graph
datacatalogue.sodanet.gr
datacatalogue.cessda.eu
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Κατάλογος Δεδομένων SoDaNet (2024). The Enhanced Microsoft Academic Knowledge Graph [Dataset]. http://doi.org/10.17903/FK2/TZWQPD
Explore at:
Unique identifier
https://doi.org/10.17903/FK2/TZWQPD
Dataset updated
Apr 30, 2024
Dataset provided by
Κατάλογος Δεδομένων SoDaNet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1800 - Dec 31, 2021
Area covered
Worldwide
Dataset funded by
European Commission
Description
The Enhanced Microsoft Academic Knowledge Graph (EMAKG) is a large dataset of scientific publications and related entities, including authors, institutions, journals, conferences, and fields of study. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG), one of the most extensive freely available knowledge graphs of scholarly data. To build the dataset, we first assessed the limitations of the current MAKG. Then, based on these, several methods were designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. EMAKG provides two main advantages: It has improved usability, facilitating access to non-expert users It includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility and migration research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors as well as their annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the dataset also includes: fields of study (and publications) labelled by their discipline(s); abstracts and linguistic features, i.e., standard language codes, tokens , and types entities’ general information, e.g., date of foundation and type of institutions; and academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics, computational linguistics, and mobility and human migration, among others.
Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7909511
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0
Example of a row in entity2id.txt:
/g/112yfy2xr, 2
Example of a row in relation2id.txt:
/music/album/release_type, 192
Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90
Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person
Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person"
object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77
Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147
Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867
Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
Z
OpenAIRE Graph Dataset
data.niaid.nih.gov
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAIRE Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3516917
Explore at:
Dataset updated
Sep 24, 2024
Dataset provided by
Vergoulis, Thanasis
Ioannidis, Alexandros
Principe, Pedro
Foufoulas, Ioannis
Chatzopoulos, Serafeim
Lempesis, Antonis
La Bruzzo, Sandro
Iatropoulou, Katerina
Artini, Michele
Dimitropoulos, Harry
Baglioni, Miriam
Atzori, Claudio
Manola, Natalia
Bardi, Alessia
Manghi, Paolo
Kokogiannaki, Argiro
De Bonis, Michele
Mannocci, Andrea
Horst, Marek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The OpenAIRE Graph is exported as several dataset, so you can download the parts you are interested into.

publication_[part].tar: metadata records about research literature (includes types of publications listed here)dataset_[part].tar: metadata records about research data (includes the subtypes listed here) software.tar: metadata records about research software (includes the subtypes listed here)otherresearchproduct_[part].tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.project.tar: metadata records about project grants.relation_[part].tar: metadata records about relations between entities in the graph.communities_infrastructures.tar: metadata records about research communities and research infrastructures

Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.13120886. The documentation for the model is available at https://graph.openaire.eu/docs/data-model/

Learn more about the OpenAIRE Graph at https://graph.openaire.eu.

Discover the graph's content on OpenAIRE EXPLORE and our API for developers.
Number of data compromises and impacted individuals in U.S. 2005-2023
statista.com
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Number of data compromises and impacted individuals in U.S. 2005-2023 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
Explore at:
Dataset updated
Dec 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
In 2023, the number of data compromises in the United States stood at 3,205 cases. Meanwhile, over 353 million individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2022, healthcare, financial services, and manufacturing were the three industry sectors that recorded most data breaches. The number of healthcare data breaches in the United States has gradually increased within the past few years. In the financial sector, data compromises increased almost twice between 2020 and 2022, while manufacturing saw an increase of more than three times in data compromise incidents. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015

Sample Graph Datasets in CSV Format

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14335015

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.