This dataset consists of cartographic data in digital line graph (DLG) form for the northeastern states (Connecticut, Maine, Massachusetts, New Hampshire, New York, Rhode Island and Vermont). Information is presented on two planimetric base categories, political boundaries and administrative boundaries, each available in two formats: the topologically structured format and a simpler format optimized for graphic display. These DGL data can be used to plot base maps and for various kinds of spatial analysis. They may also be combined with other geographically referenced data to facilitate analysis, for example the Geographic Names Information System.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This example can be viewed by uploading S1 Data into the web-based tool (http://statistika.mfub.bg.ac.rs/interactive-graph/). (XML)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains bitcoin transfer transactions extracted from the Bitcoin Mainnet blockchain.
Part1 is available at https://zenodo.org/deposit/7157356 Part3 is available at https://zenodo.org/deposit/7158133 Part4 is available at https://zenodo.org/deposit/7158328
Details of the datasets are given below:
FILENAME FORMAT:
The filenames have the following format:
btc-tx---.bz2
where is the starting block number, final block number, and is the split part of the file.
For example file btc-tx-100000-149999-aa.bz2 and the rest of the parts if any contain transactions from
block 100000 to block 149999 inclusive.
The files are compressed with bzip2. They can be uncompressed using command bunzip2.
TRANSACTION FORMAT:
Each line in a file corresponds to a transaction. The transaction has the following format:
Type of transaction (i.e. BTC-IN or BTC-OUT).
Number of the block which contains the transaction.
Position of the transaction in the block (i.e. transaction number in the block).
Source bitcoin address/transaction of the transfer.
Destination bitcoin address/transaction of the transfer.
Amount of transfer.
BLOCK TIME FORMAT:
The block time file has the following format:
Number of the block.
Unix timestamp at which the block is mined as a hexadecimal number.
IMPORTANT NOTE:
Public Bitcoin Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as https://btcscan.org . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything.
NOTE:
If you use this dataset, please do not forget to add the DOI number to the citation.
If you use our dataset in your research, please also cite our paper: https://link.springer.com/chapter/10.1007/978-3-030-94590-9_14
@incollection{kilicc2022analyzing, title={Analyzing Large-Scale Blockchain Transaction Graphs for Fraudulent Activities}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and {\c{S}}en, Alper}, booktitle={Big Data and Artificial Intelligence in Digital Finance}, pages={253--267}, year={2022}, publisher={Springer, Cham} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Patient-drug-disease (PDD) Graph dataset, utilising Electronic medical records (EMRS) and biomedical Knowledge graphs. The novel framework to construct the PDD graph is described in the associated publication.PDD is an RDF graph consisting of PDD facts, where a PDD fact is represented by an RDF triple to indicate that a patient takes a drug or a patient is diagnosed with a disease. For instance, (pdd:274671, pdd:diagnosed, sepsis)Data files are in .nt N-Triple format, a line-based syntax for an RDF graph. These can be accessed via openly-available text edit software.diagnose_icd_information.nt - contains RDF triples mapping patients to diagnoses. For example:(pdd:18740, pdd:diagnosed, icd99592),where pdd:18740 is a patient entity, and icd99592 is the ICD-9 code of sepsis.drug_patients.nt- contains RDF triples mapping patients to drugs. For example:(pdd:18740, pdd:prescribed, aspirin),where pdd:18740 is a patient entity, and aspirin is the drug's name.Background:Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Faced with patients' symptoms, experienced caregivers make the right medical decisions based on their professional knowledge, which accurately grasps relationships between symptoms, diagnoses and corresponding treatments. In the associated paper, we aim to capture these relationships by constructing a large and high-quality heterogenous graph linking patients, diseases, and drugs (PDD) in EMRs. Specifically, we propose a novel framework to extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented in this paper is accessible on the Web via the SPARQL endpoint as well as in .nt format in this repository, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.De-identificationIt is necessary to mention that MIMIC-III contains clinical information of patients. Although the protected health information was de-identifed, researchers who seek to use more clinical data should complete an on-line training course and then apply for the permission to download the complete MIMIC-III dataset: https://mimic.physionet.org/
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this project, we aimed to map the visualisation design space of visualisation embedded in right-to-left (RTL) scripts. We aimed to expand our knowledge of visualisation design beyond the dominance of research based on left-to-right (LTR) scripts. Through this project, we identify common design practices regarding the chart structure, the text, and the source. We also identify ambiguity, particularly regarding the axis position and direction, suggesting that the community may benefit from unified standards similar to those found on web design for RTL scripts. To achieve this goal, we curated a dataset that covered 128 visualisations found in Arabic news media and coded these visualisations based on the chart composition (e.g., chart type, x-axis direction, y-axis position, legend position, interaction, embellishment type), text (e.g., availability of text, availability of caption, annotation type), and source (source position, attribution to designer, ownership of the visualisation design). Links are also provided to the articles and the visualisations. This dataset is limited for stand-alone visualisations, whether they were single-panelled or included small multiples. We also did not consider infographics in this project, nor any visualisation that did not have an identifiable chart type (e.g., bar chart, line chart). The attached documents also include some graphs from our analysis of the dataset provided, where we illustrate common design patterns and their popularity within our sample.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
SynthChartNet
SynthChartNet is a multimodal dataset designed for training the SmolDocling model on chart-based document understanding tasks. It consists of 1,981,157 synthetically generated samples, where each image depicts a chart (e.g., line chart, bar chart, pie chart, stacked bar chart), and the associated ground truth is given in OTSL format. Charts were rendered at 120 DPI using a diverse set of visualization libraries: Matplotlib, Seaborn, and Pyecharts, enabling… See the full description on the dataset page: https://huggingface.co/datasets/ds4sd/SynthChartNet.
Use the Chart Viewer template to display bar charts, line charts, pie charts, histograms, and scatterplots to complement a map. Include multiple charts to view with a map or side by side with other charts for comparison. Up to three charts can be viewed side by side or stacked, but you can access and view all the charts that are authored in the map. Examples: Present a bar chart representing average property value by county for a given area. Compare charts based on multiple population statistics in your dataset. Display an interactive scatterplot based on two values in your dataset along with an essential set of map exploration tools. Data requirements The Chart Viewer template requires a map with at least one chart configured. Key app capabilities Multiple layout options - Choose Stack to display charts stacked with the map, or choose Side by side to display charts side by side with the map. Manage chart - Reorder, rename, or turn charts on and off in the app. Multiselect chart - Compare two charts in the panel at the same time. Bookmarks - Allow users to zoom and pan to a collection of preset extents that are saved in the map. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains, up to isomorphism, all (15_4,20_3) and (15_5,25_3) configurations, all (16_6,32_3) configurations with nontrivial automorphisms, as well as all 4-regular graphs on 15 vertices, 6-regular graphs on 15 vertices, 3-regular graphs on 16 vertices, and 4-regular graphs on 17 vertices. The configurations uniquely give regular linear spaces with parameters (15|2^45,3^20), (15|2^30,3^25), and (16|2^24,3^32). All files are compressed with gzip.
The dataset supplements the publication "On the Regular Linear Spaces up to Order 16" by Anton Betten, Dieter Betten, Daniel Heinlein, and Patric R. J. Östergård.
In the files containing configurations, each line is a configuration with the syntax
Example:
Assuming a total of 15 points labeled with {0,...,14}, the characteristic vector of a block {1,3,14} is
(0)100|0000|0000|1010
The first bit is padding as each hexadecimal number encodes four bits. Vertical bars designate groups of four bits. Consequently, the block is encoded as
400a
The following example shows the first line of one of the files:
$ zcat conf_15_4_20_3.txt.gz | head -n1
15 20 1081 4101 2201 0c01 0026 004a 0092 4402 008c 0054 0a04 0038 2108 1110 0160 0620 08c0 5200 3400 6800 A1
For the files containing graphs, we apply the graph6 file format but we extend each line by the corresponding number of automorphisms as described for configurations above, without the letter A. Programs for manipulating graphs in the graph6 format can be found in the gtools package that comes with the graph isomorphism program nauty (https://pallini.di.uniroma1.it/). Details regarding the graph6 format can be found in the documentation of nauty (https://pallini.di.uniroma1.it/Guide.html).
For graphs with a most 62 vertices, which holds in all cases here, a line in graph6 format is the ASCII converted equivalent of
Example:
Assume a graph with 5 vertices and edges: 02, 04, 13, 34 (the path 2-0-4-3-1), which has the adjacency matrix
00101
00010
10000
01001
10010
Hence, the upper triangle read column-wise is
0100101001
After padding we get
010010100100
and after grouping
010010|100100
Converting to decimal and adding 63 gives
63+16+2|63+32+4
that is
81|99
The number of vertices is 5, so we prepend 5+63=68:
68 81 99
The line in graph6 format is therefore
DQc
and our nonstandard appending of the order of the automorphism group gives
DQc 2
The first line of one of the files is as follows:
$ zcat graph_15_4.txt.gz | head -n1
Ns_???BAwjDoTOY_M_? 2
The orders of the automorphism groups and the numbers of isomorphism classes are as follows. The (up to isomorphism) 114711393113 (16_6,32_3) regular linear spaces with no nontrivial automorphisms are not stored.
(15_4,20_3) | (15_5,25_3) | (16_6,32_3) | |
---|---|---|---|
1 | 251712191 | 1442354689 | 114711393113 |
2 | 94229 | 180367 | 1125379 |
3 | 1129 | 2178 | 17287 |
4 | 915 | 936 | 3054 |
5 | 29 | 33 | |
6 | 142 | 180 | 240 |
8 | 85 | 36 | 50 |
9 | 4 | ||
10 | 4 | 4 | |
12 | 10 | 13 | 30 |
15 | 1 | ||
16 | 7 | 3 | |
18 | 4 | 3 | 2 |
20 | 2 | 2 | |
24 | 10 | 5 | 2 |
30 | 1 | ||
32 | 1 | ||
36 | 4 | 2 | |
40 | 2 | 1 | |
48 | 4 | 1 | |
72 | 1 | ||
96 | 1 | ||
120 | 1 | ||
600 | 1 | ||
720 | 1 | ||
total | 251808770 | 1442538454 | 114712539165 |
4-regular graphs with 15 vertices | 6-regular graphs with 15 vertices | 3-regular graphs with 16 vertices | 4-regular graphs with 17 vertices | |
---|---|---|---|---|
1 | 656794 | 1396131168 | 1547 | 76356249 |
2 | 119881 | 69928313 | 1261 | 8665624 |
3 | 17 | 630 | 2 | 127 |
4 | 21500 | 3848635 | 667 | 997704 |
5 | 14 | |||
6 | 409 | 55060 | 15 | 27213 |
8 | 4789 | 274294 | 330 | 131662 |
10 | 10 | 35 | ||
12 | 352 | 21334 | 11 | 12577 |
14 | 4 | |||
16 | 1020 | 23435 | 147 | 19786 |
18 | 1 | 10 | 2 | |
20 | 7 | 12 | ||
24 | 210 | 5596 | 11 | 4344 |
28 | 18 | |||
30 | 4 | 7 | ||
32 | 243 | 2463 | 51 | 3320 |
34 | 3 | |||
36 | 1 | 128 | 53 | |
48 | 106 | 1453 | 33 | 1500 |
56 | 1 | 15 | ||
60 | 2 | 2 | ||
64 | 54 | 285 | 16 | 639 |
68 | 1 | |||
72 | 6 | 165 | 2 | 96 |
96 | 41 | 309 | 24 | 504 |
112 | 7 | |||
120 | 5 | 692 | ||
128 | 10 | 48 | 4 | 132 |
140 | 1 | |||
144 | 10 | 74 | 3 | 82 |
168 | 1 | 1 | ||
192 | 14 | 77 | 20 | 193 |
216 | 2 | 3 | ||
224 | 2 | 6 | ||
240 | 18 | 1 | 2 | 497 |
256 | 1 | 6 | 1 | 24 |
280 | 1 | |||
288 | 5 | 36 | 9 | 53 |
320 | 4 | |||
384 | 6 | 26 | 11 | 58 |
432 | 9 | 3 | 2 | |
448 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of two parts: (1) The variation rules of nutrient release from carbon sources of wetland plants. After the experiment began, water samples were collected at the same period, the original and average concentrations of TOC and TN of each sample were tested and counted, and line charts were drawn. (2) Data on the influence of carbon source materials on nitrogen removal performance of Argento, Canna and corncob. From December 8 to April 27, 2019, water samples of each treatment were collected at the same time, the original concentration, average concentration, carbon source utilization rate and nitrogen removal efficiency of TOC, NO3--N, NH4+-N and TN of each sample were tested and counted, and a line chart was drawn.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OAGT is a paper topic dataset consisting of 6942930 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last two fields of each record are the topic id from a taxonomy of 27 topics created from the entire collection and the 20 most significant topic words. Each dataset record (sample) is stored as a JSON line in the text file.
The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released
under ODC-BY license.
This data (OAGT Paper Topic Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Erion Çano, Benjamin Roth: Topic Segmentation of Research Article Collections. ArXiv 2022, CoRR abs/2205.11249, https://doi.org/10.48550/arXiv.2205.11249
A Snellen chart is an eye chart that can be used to measure visual acuity. The Snellen chart is printed with eleven lines of block letters. The first line consists of one very large letter, which may be one of several letters, for example E, H, or N. Subsequent rows have increasing numbers of letters that decrease in size. A person taking the test covers one eye from 6 metres/20 feet away, and reads aloud the letters of each row, beginning at the top. The smallest row that can be read accurately indicates the visual acuity in that specific eye. In NTR, the Snellen chart was tested at the MRI scanner.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Business process event data modeled as labeled property graphs
Data Format
-----------
The dataset comprises one labeled property graph in two different file formats.
#1) Neo4j .dump format
A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/
/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=
The .dump was created with Neo4j v3.5.
#2) .graphml format
A .zip file containing a .graphml file of the entire graph
Data Schema
-----------
The graph is a labeled property graph over business process event data. Each graph uses the following concepts
:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"
:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")
:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node
:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations
:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities
:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.
:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log
:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph
:REL relationship - placeholder for any structural relationship between two :Entity nodes
The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552
Data Contents
-------------
neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)
An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1
This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.
The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).
The data contains the following entities and their events
- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased
Data Size
---------
BPIC19, nodes: 1926651, relationships: 15082099
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/P0RROUhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/P0RROU
The Value Line Investment Survey is one of the oldest, continuously running investment advisory publications. Since 1955, the Survey has been published in multiple formats including print, loose-leaf, microfilm and microfiche. Data from 1997 to present is now available online. The Survey tracks 1700 stocks across 92 industry groups. It provides reported and projected measures of firm performance, proprietary rankings and analysis for each stock on a quarterly basis. DATA AVAILABLE FOR YEARS: 1980-1989 This dataset, a subset of the Survey covering the years 1980-1989 has been digitized from the microfiche collection available at the Dewey Library (FICHE HG 4501.V26). It is only available to MIT students and faculty for academic research. Published weekly, each edition of the Survey has the following three parts: Summary & Index: includes an alphabetical listing of all industries with their relative ranking and the page number for detailed industry analysis. It also includes an alphabetical listing of all stocks in the publication with references to their location in Part 3, Ratings & Reports. Selection & Opinion: contains the latest economic and stock market commentary and advice along with one or more pages of research on interesting stocks or industries, and a variety of pertinent economic and stock market statistics. It also includes three model stock portfolios. Ratings & Reports: This is the core of the Value Line Investment Survey. Preceded by an industry report, each one-page stock report within that industry includes Timeliness, Safety and Technical rankings, 3-to 5-year analyst forecasts for stock prices, income and balance sheet items, up to 17 years of historical data, and Value Line analysts’ commentaries. The report also contains stock price charts, quarterly sales, earnings, and dividend information. Publication Schedule: Each edition of the Survey covers around 130 stocks in seven to eight industries on a preset sequential schedule so that all 1700 stocks are analyzed once every 13 weeks or each quarter. All editions are numbered 1-13 within each quarter. For example, in 1980, reports for Chrysler appear in edition 1 of each quarter on the following dates: January 4, 1980 – page 132 April 4, 1980 – page 133 July 4, 1980 – page 133 October 1, 1980 – page 133 Reports for Coca-Cola were published in edition 10 of each quarter on: March 7, 1980 – page 1514 June 6, 1980 – page 1518 Sept. 5, 1980 – page 1517 Dec. 5, 1980 – page 1548 Any significant news affecting a stock between quarters is covered in the supplementary reports that appear at the end of part 3, Ratings & Reports. File format: Digitized files within this dataset are in PDF format and are arranged by publication date within each compressed annual folder. How to Consult the Value Line Investment Survey: To find reports on a particular stock, consult the alphabetical listing of stocks in the Summary & Index part of the relevant weekly edition. Look for the page number just to the left of the company name and then use the table below to identify the edition where that page number appears. All editions within a given quarter are numbered 1-13 and follow equally sized page ranges for stock reports. The table provides page ranges for stock reports within editions 1-13 of 1980 Q1. It can be used to identify edition and page numbers for any quarter within a given year. Ratings & Reports Edition Pub. Date Pages 1 04-Jan-80 100-242 2 11-Jan-80 250-392 3 18-Jan-80 400-542 4 25-Jan-80 550-692 5 01-Feb-80 700-842 6 08-Feb-80 850-992 7 15-Feb-80 1000-1142 8 22-Feb-80 1150-1292 9 29-Feb-80 1300-1442 10 07-Mar-80 1450-1592 11 14-Mar-80 1600-1742 12 21-Mar-80 1750-1908 13 28-Mar-80 2000-2142 Another way to navigate to the Ratings & Reports part of an edition would be to look around page 50 within the PDF document. Note that the page numbers of the PDF will not match those within the publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set is a collection of environmental records associated with the individual events. The data set has been generated using the serdif-api wrapper (https://github.com/navarral/serdif-api) when sending a CSV file with example events for the Republic of Ireland. The serdif-api send a semantic query that (i) selects the environmental data sets within the region of the event, (ii) filters by the specific period of interest from the event, (iii) aggregates the data sets using the minimum, maximum, average or sum for each of the available variables for a specific time unit. The aggregation method and the time unit can be passed to the serdif-api through the Command Line Interface (CLI) (see example in https://github.com/navarral/serdif-api). The resulting data set format can be also specified as data table (CSV) or as graph (RDF) for analysis and publication as FAIR data. The open-ready data for research is retrieved as a zip file that contains: (i) data as csv: environmental data associated to particular events as a data table (ii) data as rdf: environmental data associated to particular events as a graph (iii) metadata for publication as rdf: metadata record with generalized information about the data that do not contain personal data as a graph; therefore, publishable. (iv) metadata for research as rdf: metadata records with detailed information about the data, such as individual dates, regions, data sets used and data lineage; which could lead to data privacy issues if published without approval from the Data Protection Officer (DPO) and data controller.
This dataset contains ether as well as popular ERC20 token transfer transactions extracted from the Ethereum Mainnet blockchain.
Only send ether, contract function call, contract deployment transactions are present in the dataset. Miner reward transactions are not currently included in the dataset.
Details of the datasets are given below:
FILENAME FORMAT:
The filenames have the following format:
eth-tx-
where
For example file eth-tx-1000000-1099999.txt.bz2 contains transactions from
block 1000000 to block 1099999 inclusive.
The files are compressed with bzip2. They can be uncompressed using command bunzip2.
TRANSACTION FORMAT:
Each line in a file corresponds to a transaction. The transaction has the following format:
units. ERC20 tokens transfers (transfer and transferFrom function calls in ERC20
contract) are indicated by token symbol. For example GUSD is Gemini USD stable
coin. The JSON file erc20tokens.json given below contains the details of ERC20 tokens.
decoder-error.txt FILE:
This file contains transactions (block no, tx no, tx hash) on each line that produced
error while decoding calldata. These transactions are not present in the data files.
er20tokens.json FILE:
This file contains the list of popular ERC20 token contracts whose transfer/transferFrom
transactions appear in the data files.
-------------------------------------------------------------------------------------------
[
{
"address": "0xdac17f958d2ee523a2206206994597c13d831ec7",
"decdigits": 6,
"symbol": "USDT",
"name": "Tether-USD"
},
{
"address": "0xB8c77482e45F1F44dE1745F52C74426C631bDD52",
"decdigits": 18,
"symbol": "BNB",
"name": "Binance"
},
{
"address": "0x2af5d2ad76741191d15dfe7bf6ac92d4bd912ca3",
"decdigits": 18,
"symbol": "LEO",
"name": "Bitfinex-LEO"
},
{
"address": "0x514910771af9ca656af840dff83e8264ecf986ca",
"decdigits": 18,
"symbol": "LNK",
"name": "Chainlink"
},
{
"address": "0x6f259637dcd74c767781e37bc6133cd6a68aa161",
"decdigits": 18,
"symbol": "HT",
"name": "HuobiToken"
},
{
"address": "0xf1290473e210b2108a85237fbcd7b6eb42cc654f",
"decdigits": 18,
"symbol": "HEDG",
"name": "HedgeTrade"
},
{
"address": "0x9f8f72aa9304c8b593d555f12ef6589cc3a579a2",
"decdigits": 18,
"symbol": "MKR",
"name": "Maker"
},
{
"address": "0xa0b73e1ff0b80914ab6fe0444e65848c4c34450b",
"decdigits": 8,
"symbol": "CRO",
"name": "Crypto.com"
},
{
"address": "0xd850942ef8811f2a866692a623011bde52a462c1",
"decdigits": 18,
"symbol": "VEN",
"name": "VeChain"
},
{
"address": "0x0d8775f648430679a709e98d2b0cb6250d2887ef",
"decdigits": 18,
"symbol": "BAT",
"name": "Basic-Attention"
},
{
"address": "0xc9859fccc876e6b4b3c749c5d29ea04f48acb74f",
"decdigits": 0,
"symbol": "INO",
"name": "INO-Coin"
},
{
"address": "0x8e870d67f660d95d5be530380d0ec0bd388289e1",
"decdigits": 18,
"symbol": "PAX",
"name": "Paxos-Standard"
},
{
"address": "0x17aa18a4b64a55abed7fa543f2ba4e91f2dce482",
"decdigits": 18,
"symbol": "INB",
"name": "Insight-Chain"
},
{
"address": "0xc011a72400e58ecd99ee497cf89e3775d4bd732f",
"decdigits": 18,
"symbol": "SNX",
"name": "Synthetix-Network"
},
{
"address": "0x1985365e9f78359a9B6AD760e32412f4a445E862",
"decdigits": 18,
"symbol": "REP",
"name": "Reputation"
},
{
"address": "0x653430560be843c4a3d143d0110e896c2ab8ac0d",
"decdigits": 16,
"symbol": "MOF",
"name": "Molecular-Future"
},
{
"address": "0x0000000000085d4780B73119b644AE5ecd22b376",
"decdigits": 18,
"symbol": "TUSD",
"name": "True-USD"
},
{
"address": "0xe41d2489571d322189246dafa5ebde1f4699f498",
"decdigits": 18,
"symbol": "ZRX",
"name": "ZRX"
},
{
"address": "0x8ce9137d39326ad0cd6491fb5cc0cba0e089b6a9",
"decdigits": 18,
"symbol": "SXP",
"name": "Swipe"
},
{
"address": "0x75231f58b43240c9718dd58b4967c5114342a86c",
"decdigits": 18,
"symbol": "OKB",
"name": "Okex"
},
{
"address": "0xa974c709cfb4566686553a20790685a47aceaa33",
"decdigits": 18,
"symbol": "XIN",
"name": "Mixin"
},
{
"address": "0xd26114cd6EE289AccF82350c8d8487fedB8A0C07",
"decdigits": 18,
"symbol": "OMG",
"name": "OmiseGO"
},
{
"address": "0x89d24a6b4ccb1b6faa2625fe562bdd9a23260359",
"decdigits": 18,
"symbol": "SAI",
"name": "Sai Stablecoin v1.0"
},
{
"address": "0x6c6ee5e31d828de241282b9606c8e98ea48526e2",
"decdigits": 18,
"symbol": "HOT",
"name": "HoloToken"
},
{
"address": "0x6b175474e89094c44da98b954eedeac495271d0f",
"decdigits": 18,
"symbol": "DAI",
"name": "Dai Stablecoin"
},
{
"address": "0xdb25f211ab05b1c97d595516f45794528a807ad8",
"decdigits": 2,
"symbol": "EURS",
"name": "Statis-EURS"
},
{
"address": "0xa66daa57432024023db65477ba87d4e7f5f95213",
"decdigits": 18,
"symbol": "HPT",
"name": "HuobiPoolToken"
},
{
"address": "0x4fabb145d64652a948d72533023f6e7a623c7c53",
"decdigits": 18,
"symbol": "BUSD",
"name": "Binance-USD"
},
{
"address": "0x056fd409e1d7a124bd7017459dfea2f387b6d5cd",
"decdigits": 2,
"symbol": "GUSD",
"name": "Gemini-USD"
},
{
"address": "0x2c537e5624e4af88a7ae4060c022609376c8d0eb",
"decdigits": 6,
"symbol": "TRYB",
"name": "BiLira"
},
{
"address": "0x4922a015c4407f87432b179bb209e125432e4a2a",
"decdigits": 6,
"symbol": "XAUT",
"name": "Tether-Gold"
},
{
"address": "0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48",
"decdigits": 6,
"symbol": "USDC",
"name": "USD-Coin"
},
{
"address": "0xa5b55e6448197db434b92a0595389562513336ff",
"decdigits": 16,
"symbol": "SUSD",
"name": "Santender"
},
{
"address": "0xffe8196bc259e8dedc544d935786aa4709ec3e64",
"decdigits": 18,
"symbol": "HDG",
"name": "HedgeTrade"
},
{
"address": "0x4a16baf414b8e637ed12019fad5dd705735db2e0",
"decdigits": 2,
"symbol": "QCAD",
"name": "QCAD"
}
]
-------------------------------------------------------------------------------------------
This dataset consists of cartographic data in digital line graph (DLG) form for the northeastern states (Connecticut, Maine, Massachusetts, New Hampshire, New York, Rhode Island and Vermont). Information is presented on two planimetric base categories, political boundaries and administrative boundaries, each available in two formats: the topologically structured format and a simpler format optimized for graphic display. These DGL data can be used to plot base maps and for various kinds of spatial analysis. They may also be combined with other geographically referenced data to facilitate analysis, for example the Geographic Names Information System.