Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
Companion data for the creation of a banksia plot:Background:In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.Methods:The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.Results:In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.Conclusions:The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.
Dataset Collection We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014. Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice. The dataset D1 contains available apps in the Google Play Store in January 2014. Then, we created a new snapshot (D2) of the Google Play Store in March 2014.
The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a median number of 1.978 apps per category.
For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.
In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus several reviews lack some of these details. From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).
Dataset Stats Some stats about the datasets:
D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.
D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.
Additional stats about the datasets are available here.
Dataset Description To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges. We chose a graph database because the graph visualization helps to identify connections among data (e.g., clusters of apps sharing similar sets of permission requests).
In particular, our dataset graph contains six types of nodes: - APP nodes containing metadata of each app, - PERMISSION nodes describing permission types, - CATEGORY nodes describing app categories, - SUBCATEGORY nodes describing app subcategories, - USER_REVIEW nodes storing user reviews. - TOPIC topics mined from user reviews (using LDA).
Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:
Dataset Files Info
Neo4j 2.0 Databases
googlePlayDB1-Jan2014_neo4j_2_0.rar
googlePlayDB2-Mar2014_neo4j_2_0.rar We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).
Neo4j 3.5 Databases
googlePlayDB1-Jan2014_neo4j_3_5_28.rar
googlePlayDB2-Mar2014_neo4j_3_5_28.rar Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28. The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'. The tool can be downloaded from the official Neo4j Donwload page.
In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide.
First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.
The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.
Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).
The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Enhanced Microsoft Academic Knowledge Graph (EMAKG) is a large dataset of scientific publications and related entities, including authors, institutions, journals, conferences, and fields of study. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG), one of the most extensive freely available knowledge graphs of scholarly data. To build the dataset, we first assessed the limitations of the current MAKG. Then, based on these, several methods were designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. EMAKG provides two main advantages: It has improved usability, facilitating access to non-expert users It includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility and migration research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors as well as their annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the dataset also includes: fields of study (and publications) labelled by their discipline(s); abstracts and linguistic features, i.e., standard language codes, tokens , and types entities’ general information, e.g., date of foundation and type of institutions; and academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics, computational linguistics, and mobility and human migration, among others.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Graph Database Market size was valued at USD 2.86 Billion in 2024 and is projected to reach USD 14.58 Billion by 2032, growing at a CAGR of 22.6% from 2026 to 2032.
Global Graph Database Market Drivers
The growth and development of the Graph Database Market is attributed to certain main market drivers. These factors have a big impact on how Graph Database are demanded and adopted in different sectors. Several of the major market forces are as follows:
Growth of Connected Data: Graph databases are excellent at expressing and querying relationships as businesses work with datasets that are more complex and interconnected. Graph databases are becoming more and more in demand as connected data gains significance across multiple industries.
Knowledge Graph Emergence: In fields like artificial intelligence, machine learning, and data analytics, knowledge graphs—which arrange information in a graph structure—are becoming more and more popular. Knowledge graphs can only be created and queried via graph databases, which is what is causing their widespread use.
Analytics and Machine Learning Advancements: Graph databases handle relationships and patterns in data effectively, enabling applications related to advanced analytics and machine learning. Graph databases are becoming more and more in demand when combined with analytics and machine learning as businesses want to extract more insights from their data.
Real-Time Data Processing: Graph databases can process data in real-time, which makes them appropriate for applications that need quick answers and insights. In situations like fraud detection, recommendation systems, and network analysis, this is especially helpful.
Increasing Need for Security and Fraud Detection: Graph databases are useful for fraud security and detection applications because they can identify patterns and abnormalities in linked data. The growing need for graph databases in security solutions is a result of the ongoing evolution of cybersecurity threats.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Business process event data modeled as labeled property graphsData Format-----------The dataset comprises one labeled property graph in two different file formats.#1) Neo4j .dump formatA neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs//bin/neo4j-admin.(bat|sh) load --database=graph.db --from=The .dump was created with Neo4j v3.5.#2) .graphml formatA .zip file containing a .graphml file of the entire graphData Schema-----------The graph is a labeled property graph over business process event data. Each graph uses the following concepts:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp":Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID"):Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph:REL relationship - placeholder for any structural relationship between two :Entity nodesThe concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552Data Contents-------------neo4j-bpic14-2021-02-17 (.dump|.graphml.zip)An integrated graph describing the raw event data of the entire BPI Challenge 2014 dataset. van Dongen, B.F. (Boudewijn) (2014): BPI Challenge 2014. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:c3e5d162-0cfd-4bb0-bd82-af5268819c35 BPI Challenge 2014: Similar to other ICT companies, Rabobank Group ICT has to implement an increasing number of software releases, while the time to market is decreasing. Rabobank Group ICT has implemented the ITIL-processes and therefore uses the Change-proces for implementing these so called planned changes. Rabobank Group ICT is looking for fact-based insight into sub questions, concerning the impact of changes in the past, to predict the workload at the Service Desk and/or IT Operations after future changes. The challenge is to design a (draft) predictive model, which can be used to implement in a BI environment. The purpose of this predictive model will be to support Business Change Management in implementing software releases with less impact on the Service Desk and/or IT Operations. We have prepared several case-files with anonymous information from Rabobank Netherlands Group ICT for this challenge. The files contain record details from an ITIL Service Management tool called HP Service Manager. The original data had the information as extracts in CSV with the Interaction-, Incident- or Change-number as case ID. Next to these case-files, we provide you with an Activity-log, related to the Incident-cases. There is also a document detailing the data in the CSV file and providing background to the Service Management tool. All this information is integrated in the labeled property graph in this dataset.The data contains the following entities and their events- ServiceComponent - an IT hardware or software component in a financial institute- ConfigurationItem - an part of a ServiceComponent that can be configured, changed, or modified- Incident - a problem or issue that occurred at a configuration item of a service component- Interaction - a logical grouping of activities performed for investigating an incident and identifying a solution for the incident- Change - a logical grouping of activities performed to change or modify one or more configuration items- Case_R - a user or worker involved in any of the steps- KM - an entry in the knowledge database used to resolve incidentsData Size---------BPIC14, nodes: 919838, relationships: 6682386
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text
. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark
the code used to generate the benchmarkevaluation
evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study examines different graph-based methods of detecting anomalous activities on digital markets, proposing the most efficient way to increase market actors’ protection and reduce information asymmetry. Anomalies are defined below as both bots and fraudulent users (who can be both bots and real people). Methods are compared against each other, and state-of-the-art results from the literature and a new algorithm is proposed. The goal is to find an efficient method suitable for threat detection, both in terms of predictive performance and computational efficiency. It should scale well and remain robust on the advancements of the newest technologies. The article utilized three publicly accessible graph-based datasets: one describing the Twitter social network (TwiBot-20) and two describing Bitcoin cryptocurrency markets (Bitcoin OTC and Bitcoin Alpha). In the former, an anomaly is defined as a bot, as opposed to a human user, whereas in the latter, an anomaly is a user who conducted a fraudulent transaction, which may (but does not have to) imply being a bot. The study proves that graph-based data is a better-performing predictor than text data. It compares different graph algorithms to extract feature sets for anomaly detection models. It states that methods based on nodes’ statistics result in better model performance than state-of-the-art graph embeddings. They also yield a significant improvement in computational efficiency. This often means reducing the time by hours or enabling modeling on significantly larger graphs (usually not feasible in the case of embeddings). On that basis, the article proposes its own graph-based statistics algorithm. Furthermore, using embeddings requires two engineering choices: the type of embedding and its dimension. The research examines whether there are types of graph embeddings and dimensions that perform significantly better than others. The solution turned out to be dataset-specific and needed to be tailored on a case-by-case basis, adding even more engineering overhead to using embeddings (building a leaderboard of grid of embedding instances, where each of them takes hours to be generated). This, again, speaks in favor of the proposed algorithm based on nodes’ statistics. The research proposes its own efficient algorithm, which makes this engineering overhead redundant.
Context The StockNet dataset, introduced by Xu and Cohen at ACL 2018, is a benchmark for measuring the effectiveness of textual information in stock market prediction. While the original dataset provides valuable price and news data, it requires significant pre-processing and feature engineering to be used effectively in advanced machine learning models.
This dataset was created to bridge that gap. We have taken the original data for 87 stocks and performed extensive feature engineering, creating a rich, multi-modal feature repository.
A key contribution of this work is a preliminary statistical analysis of the news data for each stock. Based on the consistency and volume of news, we have categorized the 87 stocks into two distinct groups, allowing researchers to choose the most appropriate modeling strategy:
joint_prediction_model_set: Stocks with rich and consistent news data, ideal for building complex, single models that analyze all stocks jointly.
panel_data_model_set: Stocks with less consistent news data, which are better suited for traditional panel data analysis.
Content and File Structure The dataset is organized into two main directories, corresponding to the two stock categories mentioned above.
1.joint_prediction_model_set This directory contains stocks suitable for sophisticated, news-aware joint modeling.
-Directory Structure: This directory contains a separate sub-directory for each stock suitable for joint modeling (e.g., AAPL/, MSFT/, etc.).
-Folder Contents: Inside each stock's folder, you will find a set of files, each corresponding to a different category of engineered features. These files include:
-News Graph Embeddings: A NumPy tensor file (.npy) containing the encoded graph embeddings from daily news. Its shape is (Days, N, 128), where N is the number of daily articles.
-Engineered Features: A CSV file containing fundamental features derived directly from OHLCV data (e.g., intraday_range, log_return).
-Technical Indicators: A CSV file with a wide array of popular technical indicators (e.g., SMA, EMA, MACD, RSI, Bollinger Bands).
-Statistical & Time Features: A CSV file with rolling statistical features (e.g., volatility, skew, kurtosis) over an optimized window, plus cyclical time-based features.
-Advanced & Transformational Features: A CSV file with complex features like lagged variables, wavelet transform coefficients, and the Hurst Exponent.
2.panel_data_model_set This directory contains stocks that are more suitable for panel data models, based on the statistical properties of their associated news data.
-Directory Structure: Similar to the joint prediction set, this directory also contains a separate sub-directory for each stock in this category.
-Folder Contents: Inside each stock's folder, you will find the cleaned and structured price and news text data. This facilitates the application of econometric models or machine learning techniques designed for panel data, where observations are tracked for the same subjects (stocks) over a period of time.
-Further Information: For a detailed breakdown of the statistical analysis used to separate the stocks into these two groups, please refer to the data_preview.ipynb notebook located in the TRACE_ACL18_raw_data directory.
Methodology The features for the joint_prediction_model_set were generated systematically for each stock:
-News-to-Graph Pipeline: Daily news headlines were processed to extract named entities. These entities were then used to query Wikidata and build knowledge subgraphs. A Graph Convolutional Network (GCN) model encoded these graphs into dense vectors.
-Feature Engineering: All other features were generated from the raw price and volume data. The process included basic calculations, technical analysis via pandas-ta, generation of statistical and time-based features, and advanced transformations like wavelet analysis.
Acknowledgements This dataset is an extension and transformation of the original StockNet dataset. We extend our sincere gratitude to the original authors for their contribution to the field.
Original Paper: "StockNet: A Probing Task for Measuring Stock Market Prediction" by Yumeng Xu and Mohit Bansal. (ACL 2018).
Original Data Repository: https://github.com/yumoxu/stocknet-dataset
Inspiration This dataset opens the door to numerous exciting research questions:
-Can you build a single, powerful joint model using the joint_prediction_model_set to predict movements for all stocks simultaneously?
-How does a sophisticated joint model compare against a traditional panel data model trained on the panel_data_model_set?
-What is the lift in predictive power from using news-based graph embeddings versus using only technical indicators?
-Can you apply transfer learning or multi-task learning, using the feature-rich joint set to improve predictions for the panel set?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038
Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
The Enhanced Microsoft Academic Knowledge Graph (EMAKG) is a large dataset of scientific publications and related entities, including authors, institutions, journals, conferences, and fields of study. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG), one of the most extensive freely available knowledge graphs of scholarly data. To build the dataset, we first assessed the limitations of the current MAKG. Then, based on these, several methods were designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. EMAKG provides two main advantages: It has improved usability, facilitating access to non-expert users It includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility and migration research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors as well as their annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the dataset also includes: fields of study (and publications) labelled by their discipline(s); abstracts and linguistic features, i.e., standard language codes, tokens , and types entities’ general information, e.g., date of foundation and type of institutions; and academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics, computational linguistics, and mobility and human migration, among others.
As of January 2024, #love was the most used hashtag on Instagram, being included in over two billion posts on the social media platform. #Instagood and #instagram were used over one billion times as of early 2024.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]).
DataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB .
The archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve.
To facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2].
To extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:
cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep "hash://" | sort | uniq > eml-hashes.txt
extract data.tar.gz using:
~/preston-archive$ tar xzf data.tar.gz
then use Preston to extract each hash using something like:
~/preston-archive$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa
Alternatively, without using Preston, you can extract the data using the naming convention:
data/[x]/[y]/[z]/[hash]/data
where x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.
For example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].
The intended use of this archive is to facilitate meta-analysis of the DataONE dataset network.
[1] DataONE, https://www.dataone.org
[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".
[3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '
' | grep -v "hash://" | sort | uniq | wc -l
[4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '
' | grep "hash://" | sort | uniq | wc -l
[5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '
' | grep -v "hash://" | sort | uniq | wc -l
This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
This dataset contains the supplementary materials to our publication "Collaborative Problem Solving in Mixed Reality: A Study on Visual Graph Analysis", where we report on a study we conducted. Please refer to publication for more details, also the abstract can be found at the end of this description. The dataset contains: The collection of graphs with layout used in the study The final, randomized experiment files used in the study The source code of the study prototype The collected, anonymized data in tabular form The code for the statistical analysis The Supplemental Materials PDF Paper abstract: Problem solving is a composite cognitive process, invoking a number of systems and subsystems, such as perception and memory. Individuals may form collectives to solve a given problem together, in collaboration, especially when complexity is thought to be high. To determine if and when collaborative problem solving is desired, we must quantify collaboration first. For this, we investigate the practical virtue of collaborative problem solving. Using visual graph analysis, we perform a study with 72 participants in two countries and three languages. We compare ad hoc pairs to individuals and nominal pairs, solving two different tasks on graphs in visuospatial mixed reality. The average collaborating pair does not outdo its nominal counterpart, but it does have a significant trade-off against the individual: an ad hoc pair uses 1.46 more time to achieve 4.6 higher accuracy. We also use the concept of task instance complexity to quantify differences in complexity. As task instance complexity increases, these differences largely scale, though with two notable exceptions. With this study we show the importance of using nominal groups as benchmark in collaborative virtual environments research. We conclude that a mixed reality environment does not automatically imply superior collaboration.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The common practice is to model the Kidney Exchange Problem (KEP) on directed graph G = (V,A), called compatibility graph, where set of vertices V corresponds to the set of incompatible pairs and an arc from a vertex i to a vertex j indicates a compatibility of donor in i with the patient in j. The instances have been created by the most commonly used generator, described in Saidman S, Roth A, Sönmez T, Ünver M, Delmonico F. Increasing the opportunity of live kidney donation by matching for two-and three-way exchanges. Transplantation2006;81:773–82. The generator creates random graphs based on probabilities of blood type and of donor–patient tissue compatibility. Default values for the generator's parameters were used. For the research where probabilities of failure of arcs and vertices are considered, information is provided in an additional file for each instance. The probabilities of failure p_i of a vertex i, or p_ij of an arc (i,j) were generated randomly with uniform distribution in [0;1]. The dataset is split into two parts, according to the size of the graphs. The folder small contains instances with n = 20,30,40,50,60,70,80,90 and100 vertices. There are 50 instances of each size. for instances with n = 20,30,40,50,60,70,100 vertices and 10 instances of bigger sizes. The folder big has instances with n = 100, 200,300,400,500,600,700,800,900,1000,2000, 3000,5000 vertices, with 10 instances of each size. Each instance has two files with data. 1) The first one is the compatibility graph of an instance of a given size. Name of those files is formed as n_seed.input.gz where n = |V| is the number of incompatible pairs in the pool and seed is the seed, used for random function when generating the instance. In the first line the file contains values n – number of vertices in the graph and m – number of arcs in the graph. In the following m lines of the file the existing arcs (i,j) are presented as follows: i j w-ij where w_ij is the weight of the arc, which is always equal to 1.0 for all the instances in this dataset. 2) The second file contains probabilities of failure. Name is formed as n_seed.prob.gz The lines of the file formed as follows. Consecutively for each vertex i in V: p_i p_ij1… pi_jk
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For K3 and Km-e graphs, a coloring type (K3,Km-e;n) is such an edge coloring of the full Kn graph, which does not have the K3 subgraph in the first color (representing by no edges in the graph) or the Km-e subgraph in the second color (representing by edges in the graph). Km-e means the full Km graph with one edge removed.The Ramsey number R(K3,Km-e) is the smallest natural number n such that for any edge coloring of the full Kn graph there is an isomorphic subgraph with K3 in the first color (no edge in the graph) or isomorphic with Km-e in the second color (exists edge in the graph). Coloring types (K3,Km-e;n) exist for n<R(K3,Km-e). The dataset consists of:a) 3 files containing all non-isomorphic graphs that are coloring types (K3,K3-e;n) for 1<n<5,b) 5 files containing all non-isomorphic graphs that are coloring types (K3,K4-e;n) for 1<n<7,c) 9 files containing all non-isomorphic graphs that are coloring types (K3,K5-e;n) for 1<n<11,d) 15 files containing all non-isomorphic graphs that are coloring types (K3,K6-e;n) for 1<n<17. All graphs have been saved in Graph6 format (https://users.cecs.anu.edu.au/~bdm/data/formats.html).The Nauty package by Brendan D. McKay was used to check the isomorphism of the graphs (http://users.cecs.anu.edu.au/~bdm/nauty/). We recommend the survey article of S. Radziszowski containing the most important results regarding Ramsey numbers: S. Radziszowski, Small Ramsey numbers, Electron. J. Comb. Dyn. Surv. 1, revision #15, DS1: Mar 3, 2017 ( https://doi.org/10.37236/21).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total). More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:PRTThe dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies. Q1: What % of papers coming from a university are Open Access?- Dataset Name format: oa_status_countryname_papers.csv- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.- Important considerations about this dataset: - Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. - The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all paperids in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field. - Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?- Dataset Name format: fsid_countryname_papers.csv- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of fieldofstudy they belong to.- Important considerations about this dataset: * MAG can associate a paper to multiple fieldofstudyid. If a paper belongs to more than one of our fieldofstudyid, separate records were created for the paper with each of those _fieldofstudyid_s.- MAG assigns fieldofstudyid to every paper with a score. We preserve only those records whose score is more than 0.5 for any fieldofstudyid it belongs to.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Q3: What is the gender distribution in authorship of papers published by the universities?- Dataset Name format: author_gender_countryname_papers.csv- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.- Important considerations about this dataset :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- An external script was executed to determine the gender of the authors. The script is available here.Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.- Dataset Name format: author_ids_countryname_papers.csv- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.- Important considerations about this work :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each authorid.Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.- Dataset Name format: cc_oa_countryname_papers.csv- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.- Dataset Name format: rc_oa_countryname_-papers.csv- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers being referenced.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Additional files:- _fieldsofstudy_mag_.csv: this file contains a dump of fieldsofstudy table of MAG mapping each of the ids to their actual field of study name.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.