74 datasets found
  1. S

    CBCD:A Chinese Bar Chart Dataset for Data Extraction

    • scidb.cn
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan (2025). CBCD:A Chinese Bar Chart Dataset for Data Extraction [Dataset]. http://doi.org/10.57760/sciencedb.j00240.00052
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Currently, in the field of chart datasets, most existing resources are mainly in English, and there are almost no open-source Chinese chart datasets, which brings certain limitations to research and applications related to Chinese charts. This dataset draws on the construction method of the DVQA dataset to create a chart dataset focused on the Chinese environment. To ensure the authenticity and practicality of the dataset, we first referred to the authoritative website of the National Bureau of Statistics and selected 24 widely used data label categories in practical applications, totaling 262 specific labels. These tag categories cover multiple important areas such as socio-economic, demographic, and industrial development. In addition, in order to further enhance the diversity and practicality of the dataset, this paper sets 10 different numerical dimensions. These numerical dimensions not only provide a rich range of values, but also include multiple types of values, which can simulate various data distributions and changes that may be encountered in real application scenarios. This dataset has carefully designed various types of Chinese bar charts to cover various situations that may be encountered in practical applications. Specifically, the dataset not only includes conventional vertical and horizontal bar charts, but also introduces more challenging stacked bar charts to test the performance of the method on charts of different complexities. In addition, to further increase the diversity and practicality of the dataset, the text sets diverse attribute labels for each chart type. These attribute labels include but are not limited to whether they have data labels, whether the text is rotated 45 °, 90 °, etc. The addition of these details makes the dataset more realistic for real-world application scenarios, while also placing higher demands on data extraction methods. In addition to the charts themselves, the dataset also provides corresponding data tables and title text for each chart, which is crucial for understanding the content of the chart and verifying the accuracy of the extracted results. This dataset selects Matplotlib, the most popular and widely used data visualization library in the Python programming language, to be responsible for generating chart images required for research. Matplotlib has become the preferred tool for data scientists and researchers in data visualization tasks due to its rich features, flexible configuration options, and excellent compatibility. By utilizing the Matplotlib library, every detail of the chart can be precisely controlled, from the drawing of data points to the annotation of coordinate axes, from the addition of legends to the setting of titles, ensuring that the generated chart images not only meet the research needs, but also have high readability and attractiveness visually. The dataset consists of 58712 pairs of Chinese bar charts and corresponding data tables, divided into training, validation, and testing sets in a 7:2:1 ratio.

  2. Goodreads Radial Bar Chart Values

    • kaggle.com
    zip
    Updated Jul 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justinyouth (2023). Goodreads Radial Bar Chart Values [Dataset]. https://www.kaggle.com/datasets/justinyouth/goodreadsdataset/code
    Explore at:
    zip(207 bytes)Available download formats
    Dataset updated
    Jul 29, 2023
    Authors
    Justinyouth
    Description

    Datasets for beginners.

    The radial chart dataset for Goodreads comprises two main variables: "Value" and "Angle." Each data point represents a specific observation with corresponding values for "Value" and "Angle."

    The "Value" variable denotes the magnitude or quantity associated with each data point. In this dataset, the values range from 0 to 270, indicating diverse levels or measurements of a certain aspect related to Goodreads.

    The "Angle" variable represents the angular position of each data point on the radial chart. The angular position is crucial for plotting the data points correctly on the circular layout of the chart.

    This dataset aims to visualize and explore the distribution or relationships of the "Value" variable in a radial chart format. Radial charts, also known as circular or polar charts, are effective tools for displaying data in a circular layout, enabling quick comprehension of patterns, trends, or anomalies.

    By leveraging the radial chart visualization, users can gain insights into the distribution of values, identify potential outliers, and observe any cyclical patterns or clusters within the dataset. Additionally, the radial chart provides an intuitive representation of the dataset, allowing stakeholders to grasp the overall structure and characteristics of the data at a glance.

    The dataset's radial chart visualization can be particularly valuable for Goodreads, a popular platform for book enthusiasts, as it offers a unique perspective on various aspects of the platform's data. Further analysis and exploration of the dataset can help users make data-driven decisions, identify areas for improvement, and uncover hidden trends that may influence user engagement and experience on the platform.

  3. a

    Chart Viewer

    • city-of-lawrenceville-arcgis-hub-lville.hub.arcgis.com
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    esri_en (2021). Chart Viewer [Dataset]. https://city-of-lawrenceville-arcgis-hub-lville.hub.arcgis.com/items/be4582b38d764de0a970b986c824acde
    Explore at:
    Dataset updated
    Sep 22, 2021
    Dataset authored and provided by
    esri_en
    Description

    Use the Chart Viewer template to display bar charts, line charts, pie charts, histograms, and scatterplots to complement a map. Include multiple charts to view with a map or side by side with other charts for comparison. Up to three charts can be viewed side by side or stacked, but you can access and view all the charts that are authored in the map. Examples: Present a bar chart representing average property value by county for a given area. Compare charts based on multiple population statistics in your dataset. Display an interactive scatterplot based on two values in your dataset along with an essential set of map exploration tools. Data requirements The Chart Viewer template requires a map with at least one chart configured. Key app capabilities Multiple layout options - Choose Stack to display charts stacked with the map, or choose Side by side to display charts side by side with the map. Manage chart - Reorder, rename, or turn charts on and off in the app. Multiselect chart - Compare two charts in the panel at the same time. Bookmarks - Allow users to zoom and pan to a collection of preset extents that are saved in the map. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.

  4. Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  5. f

    Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama

    • smithsonian.figshare.com
    zip
    Updated Apr 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao (2024). [Dataset:] Data from Tree Censuses and Inventories in Panama [Dataset]. http://doi.org/10.5479/data.stri.2016.0622
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    Smithsonian Tropical Research Institute
    Authors
    Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Panama
    Description

    Abstract: These are results from a network of 65 tree census plots in Panama. At each, every individual stem in a rectangular area of specified size is given a unique number and identified to species, then stem diameter measured in one or more censuses. Data from these numerous plots and inventories were collected following the same methods as, and species identity harmonized with, the 50-ha long-term tree census at Barro Colorado Island. Precise location of every site, elevation, and estimated rainfall (for many sites) are also included. These data were gathered over many years, starting in 1994 and continuing to the present, by principal investigators R. Condit, R. Perez, S. Lao, and S. Aguilar. Funding has been provided by many organizations.Description:marenaRecent.full.Rdata5Jan2013.zip: A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format, designed for data analysis. This and all other tables labelled 'full' have one record per individual tree found in that census. Detailed documentations of the 'full' tables is given in RoutputFull.pdf (see component 10 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. These are the best data to use if only a single plot census is needed. marena2cns.full.Rdata5Jan2013.zip: R Analytical Tables of the style 'full' for 44 plots with two censuses: 'marena2cns.full1.rdata' for the first census and 'marena2cns.full2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.full (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed. marena3cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for nine plots with three censuses: 'marena3cns.full1.rdata' for the first census through 'marena2cns.full3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.full (component 2): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed. marena4cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for six plots with four censuses: 'marena4cns.full1.rdata' for the first census through 'marena4cns.full4.rdata' for the fourth census. These six plots are a subset of the nine found in marena3cns.full (component 3): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed. marenaRecent.stem.Rdata5Jan2013.zip. A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format. These are designed for data analysis. This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. The table has one record per individual stem, necessary because some individual trees have more than one stem. Detailed documentations of these tables is given in RoutputFull.pdf (see component 11 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). These are the best data to use if only a single plot census is needed, and individual stems are desired. marena2cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for 44 plots with two censuses: 'marena2cns.stem1.rdata' for the first census and 'marena3cns.stem2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.stem (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed, and individual stems are desired. marena3cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for nine plots with three censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.stem (component 6): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed, and individual stems are desired. marena4cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for six plots with four censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These six plots are a subset of the nine found in marena3cns.stem (component 7): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed, and individual stems are desired. bci.spptable.rdata. A list of the 1414 species found across all tree plots and inventories in Panama, in R format. The column 'sp' in this table is a code identifying the species in the full census tables (marena.full and marena.stem, components 1-4 and 5-8 above). RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (components 1-4 above). RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (component 5-8 above). PanamaPlot.txt: Locations of all tree plots and inventories in Panama.

  6. T

    web_graph

    • tensorflow.org
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). web_graph [Dataset]. http://identifiers.org/arxiv:2112.02194
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    This dataset contains a sparse graph representing web link structure for a small subset of the Web.

    Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

    Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

    • We started with WAT files from June 2021 crawl.
    • Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.
    • To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.
    • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.
    • Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.
    VersionTop level domainMin countNum nodesNum edges
    sparse10365.4M30B
    dense50136.5M22B
    de-sparsede1019.7M1.19B
    de-densede505.7M0.82B
    in-sparsein101.5M0.14B
    in-densein500.5M0.12B

    All versions of the dataset have following features:

    • "row_tag": a unique identifier of the row (source link).
    • "col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
    • "gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('web_graph', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  7. Variance Analysis Project

    • kaggle.com
    zip
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). Variance Analysis Project [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/variance-analysis-in-excel
    Explore at:
    zip(40666 bytes)Available download formats
    Dataset updated
    Jul 9, 2024
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    About Datasets:

    Domain : Finance Project: Variance Analysis Datasets: Budget vs Actuals Dataset Type: Excel Data Dataset Size: 482 records

    KPI's: 1. Total Income 2. Total Expenses 3. Total Savings 4. Budget vs Actual Income 5. Actual Expenses Breakdown

    Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results

    This data contains dynamic dashboard, data validation, index match, SUMIFS, conditional formatting, if conditions, column chart, pie chart.

  8. Z

    Graph topological features extracted from expression profiles of...

    • data-staging.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C (2020). Graph topological features extracted from expression profiles of neuroblastoma patients [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3357673
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Luxembourg Institute of Health
    Nanyang Technological University
    Authors
    Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This dataset contains the data described in the paper titled "A deep neural network approach to predicting clinical outcomes of neuroblastoma patients." by Tranchevent, Azuaje and Rajapakse. More precisely, this dataset contains the topological features extracted from graphs built from publicly available expression data (see details below). This dataset does not contain the original expression data, which are available elsewhere. We thank the scientists who did generate and share these data (please see below the relevant links and publications).

    Content

    File names start with the name of the publicly available dataset they are built on (among "Fischer", "Maris" and "Versteeg"). This name is followed by a tag representing whether they contain raw data ("raw", which means, in this case, the raw topological features) or TF formatted data ("TF", which stands for TensorFlow). This tag is then followed by a unique identifier representing a unique configuration. The configuration file "Global_configuration.tsv" contains details about these configurations such as which topological features are present and which clinical outcome is considered.

    The code associated to the same manuscript that uses these data is at https://gitlab.com/biomodlih/SingalunDeep. The procedure by which the raw data are transformed into the TensorFlow ready data is described in the paper.

    File format

    All files are TSV files that correspond to matrices with samples as rows and features as columns (or clinical data as columns for clinical data files). The data files contain various sets of topological features that were extracted from the sample graphs (or Patient Similarity Networks - PSN). The clinical files contain relevant clinical outcomes.

    The raw data files only contain the topological data. For instance, the file "Fischer_raw_2d0000_data_tsv" contains 24 values for each sample corresponding to the 12 centralities computed for both the microarray (Fischer-M) and RNA-seq (Fischer-R) datasets. The TensorFlow ready files do not contain the sample identifiers in the first column. However, they contain two extra columns at the end. The first extra column is the sample weights (for the classifiers and because we very often have a dominant class). The second extra column is the class labels (binary), based on the clinical outcome of interest.

    Dataset details

    The Fischer dataset is used to train, evaluate and validate the models, so the dataset is split into train / eval / valid files, which contains respectively 249, 125 and 124 rows (samples) of the original 498 samples. In contrast, the other two datasets (Maris and Versteeg) are smaller and are only used for validation (and therefore have no training or evaluation file).

    The Fischer dataset also has more data files because various configurations were tested (see manuscript). In contrast, the validation, using the Maris and Versteeg datasets is only done for a single configuration and there are therefore less files.

    For Fischer, a few configurations are listed in the global configuration file but there is no corresponding raw data. This is because these items are derived from concatenations of the original raw data (see global configuration file and manuscript for details).

    References

    This dataset is associated with Tranchevent L., Azuaje F.. Rajapakse J.C., A deep neural network approach to predicting clinical outcomes of neuroblastoma patients.

    If you use these data in your research, please do not forget to also cite the researchers who have generated the original expression datasets.

    Fischer dataset:

    Zhang W. et al., Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16(1) (2015). doi:10.1186/s13059-015-0694-1

    Wang C. et al., The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32(9), 926–932. doi:10.1038/nbt.3001

    Versteeg dataset:

    Molenaar J.J. et al., Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483(7391), 589–593. doi:10.1038/nature10910

    Maris dataset:

    Wang Q. et al., Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Res. 66(12), 6050–6062. doi:10.1158/0008-5472.CAN-05-4618

  9. Data from: [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012)

    • smithsonian.figshare.com
    • search.dataone.org
    pdf
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Condit; Suzanne Lao; Rolando Pẽrez; Steven B. Dolins; Robin Foster; Stephen Hubbell (2024). [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012) [Dataset]. http://doi.org/10.5479/data.bci.20130603
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    Smithsonian Tropical Research Institute
    Authors
    Richard Condit; Suzanne Lao; Rolando Pẽrez; Steven B. Dolins; Robin Foster; Stephen Hubbell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Barro Colorado Island
    Description

    Abstract:The 50-hectare plot at Barro Colorado Island, Panama, is a 1000 meter by 500 meter rectangle of forest inside of which all woody trees and shrubs with stems at least 1 cm in stem diameter have been censused. Every individual tree in the 50 hectares was permanently numbered with an aluminum tag in 1982, and every individual has been revisited six times since (in 1985, 1990, 1995, 2000, 2005, and 2010). In each census, every tree was measured, mapped and identified to species. Details of the census method are presented in Condit (Tropical forest census plots: Methods and results from Barro Colorado Island, Panama and a comparison with other plots; Springer-Verlag, 1998), and a description of the seven-census results in Condit, Chisholm, and Hubbell (Thirty years of forest census at Barro Colorado and the Importance of Immigration in maintaining diversity; PLoS ONE, 7:e49826, 2012).Description:CITATION TO DATABASE: Condit, R., Lao, S., Pérez, R., Dolins, S.B., Foster, R.B. Hubbell, S.P. 2012. Barro Colorado Forest Census Plot Data, 2012 Version. DOI http://dx.doi.org/10.5479/data.bci.20130603 CO-AUTHORS: Stephen Hubbell and Richard Condit have been principal investigators of the project for over 30 years. They are fully responsible for the field methods and data quality. As such, both request that data users contact them and invite them to be co-authors on publications relying on the data. More recent versions of the data, often with important updates, can be requested directly from R. Condit (conditr@gmail.com). ACKNOWLEDGMENTS: The following should be acknowledged in publications for contributions to the 50-ha plot project: R. Foster as plot founder and the first botanist able to identify so many trees in a diverse forest; R. Pérez and S. Aguilar for species identification; S. Lao for data management; S. Dolins for database design; plus hundreds of field workers for the census work, now over 2 million tree measurements; the National Science Foundation, Smithsonian Tropical Research Institute, and MacArthur Foundation for the bulk of the financial support. File 1. RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (File 5). File 2. RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (File 7). File 3. ViewFullTable.zip: A zip archive with a single ascii text file named ViewFullTable.txt holding a table with all census data from the BCI 50-ha plot. Each row is a single measurement of a single stem, with columns indicating the census, date, species name, plus tree and stem identifiers; all seven censuses are included. A full description of all columns in the table can be found at http://dx.doi.org/10.5479/data.bci.20130604 (ViewFullTable, pp. 21-22 of the pdf). File 4. ViewTax.txt: An ascii text table with information on all tree species recorded in the 50-ha plot. There are columns with taxonomics names (family, genus, species, and subspecies), plus the taxonomic authority. The column 'Mnemonic' gives a shortened code identifying each species, a code used in the R tables (Files 5, 7). The column 'IDLevel' indicates the depth to which the species is identified: if IDLevel='species', it is a fully identified, but if IDLevel='genus', the genus is known but not the species. IDLevel can also be 'family', or 'none' in case the species is not even known to family. File 5. bci.full.Rdata31Aug2012.zip: A zip archive holding seven R Analytical Tables, versions of the BCI 50 ha plot census data in R format. These are designed for data analysis. There are seven files, one for each of the 7 censuses: 'bci.full1.rdata' for the first census through 'bci.full7.rdata' for the seventh census. Each of the seven files is a table having one record per individual tree, and each includes a record for every tree found over the entire seven censuses (i.e. whether or not they were observed alive in the given census, there is a record). Detailed documentation of these tables is given in RoutputFull.pdf (File 1). File 6. bci.spptable.rdata: A list of the 1064 species found across all tree plots and inventories in Panama, in R format. This is a superset of species found in the BCI censuses: every BCI species is included, plus additional species never observed at BCI. The column 'sp' in this table is a code identifying the species in the R census tables (File 5, 7), and matching 'mnemomic' in ViewFullTable (File 3). File 7. bci.stem.Rdata31Aug2012.zip: A zip archive holding seven R Analytical Tables, versions of the BCI 50 ha plot census data in R format. These are designed for data analysis. There are seven files, one for each of the 7 censuses: 'bci.stem1.rdata' for the first census through 'bci.stem7.rdata' for the seventh census. Each of the seven files is a table having one record per individual stem, necessary because some individual trees have more than one stem. Each includes a record for every stem found over the entire seven censuses (i.e. whether or not they were observed alive in the given census, there is a record). Detailed documentation of these tables is given in RoutputStem.pdf (File 2). File 8. TSMAttributes.txt: An ascii text table giving full descriptions of measurement codes, which are also referred to as TSMCodes. These short codes are used in the column 'code' in R tables and in the column 'ListOfTSM' in ViewFullTable.txt, in both cases with individual codes separated by commas. File 9. bci_31August2012_mysql.zip: A zip archive holding one file, 'bci.sql', which is a mysqldump of the complete MySQL database (version 5.0.95, http://www.mysql.com) created 31 August 2012. The database includes data collected from seven censuses of the BCI 50 ha plot plus censuses of many additional plots elsewhere in Panama, plus transects where only species identifications were collected and trees were not tagged nor measurements made. Detailed documentation of all tables within the database can be found at (http://dx.doi.org/10.5479/data.bci.20130604). This version of the data is intended for experienced SQL users; for most, the R Analytical Tables in Rtables.zip are more useful.

  10. Z

    A study on real graphs of fake news spreading on Twitter

    • data.niaid.nih.gov
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirhosein Bodaghi (2021). A study on real graphs of fake news spreading on Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3711599
    Explore at:
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Federal University of Rio de Janeiro
    Authors
    Amirhosein Bodaghi
    Description

    *** Fake News on Twitter ***

    These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:

    1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

    2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

    3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

    4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

    5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

    The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).

    DD

    DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:

    The structure of excel files for each dataset is as follow:

    Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:

    User ID (user who has posted the current tweet/retweet)

    The description sentence in the profile of the user who has published the tweet/retweet

    The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

    Date and time of creation of the account by which the current tweet/retweet has been posted

    Language of the tweet/retweet

    Number of followers

    Number of followings (friends)

    Date and time of posting the current tweet/retweet

    Number of like (favorite) the current tweet had been acquired before crawling it

    Number of times the current tweet had been retweeted before crawling it

    Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

    The source (OS) of device by which the current tweet/retweet was posted

    Tweet/Retweet ID

    Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

    Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

    Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

    Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

    State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):

    r : The tweet/retweet is a fake news post

    a : The tweet/retweet is a truth post

    q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it

    n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)

    DG

    DG for each fake news contains two files:

    A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)

    A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)

    Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.

    The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.

  11. Z

    Data from: Dataset from : Browsing is a strong filter for savanna tree...

    • data.niaid.nih.gov
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archibald, Sally; Wayne Twine; Craddock Mthabini; Nicola Stevens (2021). Dataset from : Browsing is a strong filter for savanna tree seedlings in their first growing season [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4972083
    Explore at:
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    Centre for African Ecology, School of Animal Plant and Environmental Sciences, University of Witwatersrand, Johannesburg, South Africa AND Environmental Change Institute, School of Geography and the Environment, University of Oxford, Oxford OX1 3QY, United Kingdom
    School of Animal Plant and Environmental Sciences, University of Witwatersrand, Johannesburg, South Africa
    Centre for African Ecology, School of Animal Plant and Environmental Sciences, University of Witwatersrand, Johannesburg, South Africa
    Authors
    Archibald, Sally; Wayne Twine; Craddock Mthabini; Nicola Stevens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data presented here were used to produce the following paper:

    Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.

    The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588

    For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za

    Description of file(s):

    File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"

    The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.

    The data consist of one .csv file with the following column names:

    treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)

    File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"

    The data consist of one .csv file with the following column names:

    treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low

    File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

    Consists of a .csv file with the following column headings

    Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high

    File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

    Consists of a .csv file with the following column headings

    shtspec species name species_code species code genus genus rainclass low/medium/high seed mass mass of seed (g per 1000seeds)
    Surv_intercept coefficient of the model predicting survival from age of clip for this species Surv_slope coefficient of the model predicting survival from age of clip for this species GR_intercept coefficient of the model predicting stem diameter from seedling age for this species GR_slope coefficient of the model predicting stem diameter from seedling age for this species species_code species code max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite

  12. Spotify Charts (All Audio Data)

    • kaggle.com
    zip
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunny Kakar (2024). Spotify Charts (All Audio Data) [Dataset]. https://www.kaggle.com/datasets/sunnykakar/spotify-charts-all-audio-data
    Explore at:
    zip(3050767414 bytes)Available download formats
    Dataset updated
    Apr 15, 2024
    Authors
    Sunny Kakar
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Content

    This is a complete dataset of all the "Top 200" and "Viral 50" charts published globally by Spotify. Spotify publishes a new chart every 2-3 days. This is its entire collection since January 1, 2019. This dataset is a continuation of the Kaggle Dataset: Spotify Charts but contains 29 columns for each row that was populated using the Spotify API.

    Note

    The value of streams is NULL when the chart column is "viral50".

    Acknowledgment

    Base Dataset: Spotify Charts

    Photo by Alexander Shatov on Unsplash

  13. Z

    Data from: Tough Tables: Carefully Evaluating Entity Linking for Tabular...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cutrona, Vincenzo; Bianchi, Federico; Jiménez-Ruiz, Ernesto; Palmonari, Matteo (2023). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3840646
    Explore at:
    Dataset updated
    Jan 14, 2023
    Dataset provided by
    University of Milano - Bicocca
    Bocconi University
    City, University of London
    Authors
    Cutrona, Vincenzo; Bianchi, Federico; Jiménez-Ruiz, Ernesto; Palmonari, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tough Tables (2T) is a dataset designed to evaluate table annotation approaches in solving the CEA and CTA tasks. The dataset is compliant with the data format used in SemTab 2019, and it can be used as an additional dataset without any modification. The target knowledge graph is DBpedia 2016-10. Check out the 2T GitHub repository for more details about the dataset generation.

    New in v2.0: We release the updated version of 2T_WD! The target knowledge graph is Wikidata (online instance) and the dataset complies with the SemTab 2021 data format.

    This work is based on the following paper:

    Cutrona, V., Bianchi, F., Jimenez-Ruiz, E. and Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. ISWC 2020, LNCS 12507, pp. 1–16.

    Note on License: This dataset includes data from the following sources. Refer to each source for license details: - Wikipedia https://www.wikipedia.org/ - DBpedia https://dbpedia.org/ - Wikidata https://www.wikidata.org/ - SemTab 2019 https://doi.org/10.5281/zenodo.3518539 - GeoDatos https://www.geodatos.net - The Pudding https://pudding.cool/ - Offices.net https://offices.net/ - DATA.GOV https://www.data.gov/

    THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    Changelog:

    v2.0

    New GT for 2T_WD

    A few entities have been removed from the CEA GT, because they are no longer represented in WD (e.g., dbr:Devonté points to wd:Q21155080, which does not exist)

    Tables codes and values differ from the previous version, because of the random noise.

    Updated ancestor/descendant hierarchies to evaluate CTA.

    v1.0

    New Wikidata version (2T_WD)

    Fix header for tables CTRL_DBP_MUS_rock_bands_labels.csv and CTRL_DBP_MUS_rock_bands_labels_NOISE2.csv (column 2 was reported with id 1 in target - NOTE: the affected column has been removed from the SemTab2020 evaluation)

    Remove duplicated entries in tables

    Remove rows with wrong values (e.g., the Kazakhstan entity has an empty name "''")

    Many rows and noised columns are shuffled/changed due to the random noise generator algorithm

    Remove row "Florida","Floorida","New York, NY" from TOUGH_WEB_MISSP_1000_us_cities.csv (and all its NOISE1 variants)

    Fix header of tables:

    CTRL_WIKI_POL_List_of_current_monarchs_of_sovereign_states.csv

    CTRL_WIKI_POL_List_of_current_monarchs_of_sovereign_states_NOISE2.csv

    TOUGH_T2D_BUS_29414811_2_4773219892816395776_videogames_developers.csv

    TOUGH_T2D_BUS_29414811_2_4773219892816395776_videogames_developers_NOISE2.csv

    v0.1-pre

    First submission. It contains only tables, without GT and Targets.

  14. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  15. n

    Summary for Policymakers of the Working Group I Contribution to the IPCC...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Summary for Policymakers of the Working Group I Contribution to the IPCC Sixth Assessment Report - data for Figure SPM.4 (v20210809) [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=scenarios
    Explore at:
    Dataset updated
    Jul 1, 2021
    Description

    Data for Figure SPM.4 from the Summary for Policymakers (SPM) of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6). Figure SPM.4 panel a shows global emissions projections for CO2 and a set of key non-CO2 climate drivers, for the core set of five IPCC AR6 scenarios. Figure SPM.4 panel b shows attributed warming in 2081-2100 relative to 1850-1900 for total anthropogenic, CO2, other greenhouse gases, and other anthropogenic forcings for five Shared Socio-economic Pathway (SSP) scenarios. --------------------------------------------------- How to cite this dataset --------------------------------------------------- When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: IPCC, 2021: Summary for Policymakers. In: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 3−32, doi:10.1017/9781009157896.001. --------------------------------------------------- Figure subpanels --------------------------------------------------- The figure has two panels, with data provided for all panels in subdirectories named panel_a and panel_b. --------------------------------------------------- List of data provided --------------------------------------------------- This dataset contains: - Projected emissions from 2015 to 2100 for the five scenarios of the AR6 WGI core scenario set (SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5) - Projected warming for all anthropogenic forcers, CO2 only, non-CO2 greenhouse gases (GHGs) only, and other anthropogenic components for 2081-2100 relative to 1850-1900, for SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5. The five illustrative SSP (Shared Socio-economic Pathway) scenarios are described in Box SPM.1 of the Summary for Policymakers and Section 1.6.1.1 of Chapter 1. --------------------------------------------------- Data provided in relation to figure --------------------------------------------------- Panel a: The first column includes the years, while the next columns include the data per scenario and per climate forcer for the line graphs. - Data file: Carbon_dioxide_Gt_CO2_yr.csv. relates to Carbon dioxide emissions panel - Data file: Methane_Mt_CO2_yr.csv. relates to Methane emissions panel - Data file: Nitrous_oxide_Mt N2O_yr.csv. relates to Nitrous oxide emissions panel - Data file: Sulfur_dioxide_Mt SO2_yr.csv. relates to Sulfur dioxide emissions panel Panel b: - Data file: ts_warming_ranges_1850-1900_base_panel_b.csv. [Rows 2 to 5 relate to the first bar chart (cyan). Rows 6 to 9 relate to the second bar chart (blue). Rows 10 to 13 relate to the third bar chart (orange). Rows 14 to 17 relate to the fourth bar chart (red). Rows 18 to 21 relate to the fifth bar chart (brown).]. --------------------------------------------------- Sources of additional information --------------------------------------------------- The following weblink are provided in the Related Documents section of this catalogue record: - Link to the report webpage, which includes the report component containing the figure (Summary for Policymakers) and the Supplementary Material for Chapter 1, which contains details on the input data used in Table 1.SM.1..(Cross-Chapter Box 1.4, Figure 2). - Link to related publication for input data used in panel a.

  16. m

    Graph-Based Social Media Data on Mental Health Topics

    • data.mendeley.com
    Updated Nov 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Ady Sanjaya (2024). Graph-Based Social Media Data on Mental Health Topics [Dataset]. http://doi.org/10.17632/z45txpdp7f.2
    Explore at:
    Dataset updated
    Nov 4, 2024
    Authors
    Samuel Ady Sanjaya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is structured as a graph, where nodes represent users and edges capture their interactions, including tweets, retweets, replies, and mentions. Each node provides detailed user attributes, such as unique ID, follower and following counts, and verification status, offering insights into each user's identity, role, and influence in the mental health discourse. The edges illustrate user interactions, highlighting engagement patterns and types of content that drive responses, such as tweet impressions. This interconnected structure enables sentiment analysis and public reaction studies, allowing researchers to explore engagement trends and identify the mental health topics that resonate most with users.

    The dataset consists of three files: 1. Edges Data: Contains graph data essential for social network analysis, including fields for UserID (Source), UserID (Destination), Post/Tweet ID, and Date of Relationship. This file enables analysis of user connections without including tweet content, maintaining compliance with Twitter/X’s data-sharing policies. 2. Nodes Data: Offers user-specific details relevant to network analysis, including UserID, Account Creation Date, Follower and Following counts, Verified Status, and Date Joined Twitter. This file allows researchers to examine user behavior (e.g., identifying influential users or spam-like accounts) without direct reference to tweet content. 3. Twitter/X Content Data: This file contains only the raw tweet text as a single-column dataset, without associated user identifiers or metadata. By isolating the text, we ensure alignment with anonymization standards observed in similar published datasets, safeguarding user privacy in compliance with Twitter/X's data guidelines. This content is crucial for addressing the research focus on mental health discourse in social media. (References to prior Data in Brief publications involving Twitter/X data informed the dataset's structure.)

  17. Import Excel to Power BI

    • kaggle.com
    zip
    Updated May 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ntemis Tontikopoulos (2022). Import Excel to Power BI [Dataset]. https://www.kaggle.com/datasets/ntemistonti/excel-to-power-bi/versions/1
    Explore at:
    zip(614154 bytes)Available download formats
    Dataset updated
    May 15, 2022
    Authors
    Ntemis Tontikopoulos
    Description

    HOW TO: - Hierarchy using the category, subcategory & product fields (columns “Product Category” “Product SubCategory”, & “Product Name”). - Group the values ​​of the column "Region" into 2 groups, alphabetically, based on the name of each region.

    1. Display a table, which shows, for each value of the product hierarchy you created above, the total amount of sales ("Sales") and profitability ("Profit").
    2. The same information as the previous point (2) in a bar chart illustration.
    3. Display columns with the total sales amount ("Sales") for each value of the alphabetical grouping of the Region field you created. The color of each column should be derived from the corresponding total shipping cost (“Shipping Cost”). In the Tooltip of the illustration all numeric values ​​should have a currency format.
    4. The same diagram as above (3), with the addition of a data filter at visual level filter that will display only the data subset related to sales with positive values ​​for the field "Profit".
    5. The same diagram with the above point (3), with the addition of a data filter at visual level filter that will display only the subset of data related to sales with negative values ​​for the field "Profit".
    6. Map showing the total amount of sales (size of each point), as well as the total profitability (color of each point). Change the dimensions of the image
  18. Super Market dataset

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chiamaka Ndubuisi (2025). Super Market dataset [Dataset]. https://www.kaggle.com/datasets/chiamakandubuisi/super-market-dataset
    Explore at:
    zip(215497 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    Chiamaka Ndubuisi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Problem Statements for Data Visualization – Supermarket Sales Dataset 1. Sales Performance Across Branches Management wants to understand how sales performance varies across supermarket branches in Lagos, Abuja, Ogun, and Port Harcourt to identify the best-performing locations and areas that need improvement. • Suggested Visualizations: • Bar chart comparing total sales and profit by branch • Map chart showing sales by city • KPI cards: Total Sales, Profit, and Average Transaction Value per branch 2. Customer Purchase Behavior The marketing team needs insights into how different customer types (Member vs Normal) and genders influence purchase trends and average spending. • Suggested Visualizations: • Pie chart for customer type distribution • Bar chart for average spend by gender • Segmented comparison of total sales by customer type 3. Product Line Performance The business wants to know which product categories drive the highest revenue, quantity sold, and customer satisfaction to optimize stock levels and marketing focus. • Suggested Visualizations: • Bar chart showing total sales by product line • Column chart comparing average rating per product line • Profit margin chart by product line 4. Sales Trends Over Time The management team wants to monitor sales trends over time to identify peak periods, track seasonal variations, and plan future promotions accordingly. • Suggested Visualizations: • Line chart showing monthly or weekly sales trend • Seasonal decomposition (sales by month) • Trendline showing revenue growth 5. Payment Method Analysis The finance department needs to evaluate payment method usage (Cash, E-wallet, Credit Card) across cities to improve payment convenience and reduce transaction delays. • Suggested Visualizations: • Donut or bar chart showing share of payment methods • City-level breakdown of preferred payment type • Correlation between payment method and average transaction value 6. Customer Satisfaction Insights The customer experience team wants to explore how customer ratings relate to sales amount, product type, and branch performance to identify drivers of customer satisfaction. • Suggested Visualizations: • Scatter plot of rating vs total purchase amount • Heat map of average rating by branch and product line • KPI card showing average customer rating

  19. 1000 Empirical Time series

    • figshare.com
    • bridges.monash.edu
    • +1more
    png
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Fulcher (2023). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ben Fulcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.

  20. Z

    Data from: KGCW 2023 Challenge @ ESWC 2023

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated May 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689309
    Explore at:
    Dataset updated
    May 17, 2023
    Dataset provided by
    Universidad Politécnica de Madrid
    KU Leuven
    IDLab - Ghent University - imec
    STI Insbruck
    Authors
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2023: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at benchmarking systems to find which RDF graph construction system optimizes for metrics e.g. execution time, CPU, memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources (CPU and memory usage) for the parameters listed in this challenge, compared to the state-of-the-art of the existing tools and the baseline results provided by this challenge. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool. The tool is already tested with existing systems, relational databases e.g. MySQL and PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso which can be combined in any configuration. It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline.

    Data

    Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

    Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

    Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

    Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

    Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid.

    Scaling

    GTFS-1 SQL

    GTFS-10 SQL

    GTFS-100 SQL

    GTFS-1000 SQL

    Heterogeneity

    GTFS-100 XML + JSON

    GTFS-100 CSV + XML

    GTFS-100 CSV + JSON

    GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different steps for each parameter:

    The provided CSV files and SQL schema are loaded into a MySQL relational database.

    Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

    The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

    The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

    The pipeline is executed 5 times from which the median execution time of each step is calculated and reported. Each step with the median execution time is then reported in the baseline results with all its measured metrics. Query timeout is set to 1 hour and knowledge graph construction timeout to 24 hours. The execution is performed with the following tool , you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with the following files:

    Input dataset as CSV.

    Mapping file as RML.

    Queries as SPARQL.

    Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    Input dataset as CSV for each parameter.

    Mapping file as RML for each parameter.

    SPARQL queries to retrieve the results for each parameter.

    Baseline results for each parameter with the example pipeline.

    Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is being evaluated, the number of rows and columns may differ. The first row is always the header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    Input dataset as CSV with SQL schema for the scaling and a combination of XML,

    CSV, and JSON is provided for the heterogeneity.

    Mapping file as RML for both scaling and heterogeneity.

    SPARQL queries to retrieve the results.

    Baseline results with the example pipeline.

    Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row. JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

    CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

    Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

        Scale
        Number of Triples
    
    
    
    
        0 percent
        2000000 triples
    
    
        25 percent
        1500020 triples 
    
    
        50 percent
        1000020 triples 
    
    
        75 percent
        500020 triples
    
    
        100 percent
        20 triples
    

    Empty values

        Scale
        Number of Triples
    
    
    
    
        0 percent
        2000000 triples
    
    
        25 percent
        1500000 triples 
    
    
        50 percent
        1000000 triples 
    
    
        75 percent
        500000 triples
    
    
        100 percent
        0 triples
    

    Mappings

        Scale
        Number of Triples
    
    
    
    
        1TM + 15POM
        1500000 triples
    
    
        3TM + 5POM
        1500000 triples 
    
    
        5TM + 3POM 
        1500000 triples 
    
    
        15TM + 1POM
        1500000 triples
    

    Properties

        Scale
        Number of Triples
    
    
        1M rows 1 column
        1000000 triples
    
    
        1M rows 10 columns
        10000000 triples 
    
    
        1M rows 20 columns
        20000000 triples 
    
    
        1M rows 30 columns
        30000000 triples
    

    Records

        Scale
        Number of Triples
    
    
        10K rows 20 columns
        200000 triples
    
    
        100K rows 20 columns
        2000000 triples 
    
    
        1M rows 20 columns
        20000000 triples 
    
    
        10M rows 20 columns
        200000000 triples
    

    Joins

    1-1 joins

        Scale
        Number of Triples
    
    
        0 percent
        0 triples
    
    
        25 percent
        125000 triples 
    
    
        50 percent
        250000 triples 
    
    
        75 percent
        375000 triples
    
    
        100 percent
        500000 triples
    

    1-N joins

        Scale
        Number of Triples
    
    
        1-10 0 percent
        0 triples
    
    
        1-10 25 percent
        125000 triples 
    
    
        1-10 50 percent
        250000 triples 
    
    
        1-10 75 percent
        375000 triples
    
    
        1-10 100 percent
        500000 triples
    
    
        1-5 50 percent
        250000 triples
    
    
        1-10 50 percent
        250000 triples 
    
    
        1-15 50 percent
        250005 triples 
    
    
        1-20 50 percent
        250000 triples
    

    1-N joins

        Scale
        Number of Triples
    
    
        10-1 0 percent
        0 triples
    
    
        10-1 25 percent
        125000 triples 
    
    
        10-1 50 percent
        250000 triples 
    
    
        10-1 75 percent
        375000 triples
    
    
        10-1 100 percent
        500000 triples
    
    
        5-1 50 percent
        250000 triples
    
    
        10-1 50 percent
        250000 triples 
    
    
        15-1 50 percent
        250005 triples 
    
    
        20-1 50 percent
        250000 triples
    

    N-M joins

        Scale
        Number of Triples
    
    
        5-5 50 percent
        1374085 triples
    
    
        10-5 50 percent
        1375185 triples
    
    
        5-10 50 percent 
        1375290 triples
    
    
        5-5 25 percent
        718785 triples
    
    
        5-5 50 percent
        1374085 triples
    
    
        5-5 75 percent 
        1968100 triples
    
    
        5-5 100 percent 
        2500000 triples 
    
    
        5-10 25 percent 
        719310 triples
    
    
        5-10 50 percent 
        1375290 triples
    
    
        5-10 75 percent 
        1967660 triples
    
    
        5-10 100 percent 
        2500000 triples
    
    
        10-5 25 percent 
        719370 triples 
    
    
        10-5 50 percent 
        1375185 triples
    
    
        10-5 75 percent 
        1968235 triples
    
    
        10-5 100 percent 
        2500000 triples
    

    GTFS Madrid Bench

    Generated Knowledge Graph

        Scale
        Number of Triples
    
    
        1
        395953 triples
    
    
        10
        3959530 triples 
    
    
        100
        39595300 triples 
    
    
        1000
        395953000 triples
    

    Queries

        Query
        Scale 1
        Scale 10
        Scale 100
        Scale 1000
    
    
        Q1
        58540 results
        585400 results
        No results available
        No results available
    
    
        Q2
        636 results
        11998 results 
        125565 results
        1261368 results
    
    
        Q3
        421 results
        4207 results 
        42067 results
        420667 results
    
    
        Q4
        13 results
        130 results
        1300 results
        13000 results
    
    
        Q5
        35 results
        350 results
        3500 results
        35000
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan (2025). CBCD:A Chinese Bar Chart Dataset for Data Extraction [Dataset]. http://doi.org/10.57760/sciencedb.j00240.00052

CBCD:A Chinese Bar Chart Dataset for Data Extraction

Explore at:
315 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 14, 2025
Dataset provided by
Science Data Bank
Authors
Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Currently, in the field of chart datasets, most existing resources are mainly in English, and there are almost no open-source Chinese chart datasets, which brings certain limitations to research and applications related to Chinese charts. This dataset draws on the construction method of the DVQA dataset to create a chart dataset focused on the Chinese environment. To ensure the authenticity and practicality of the dataset, we first referred to the authoritative website of the National Bureau of Statistics and selected 24 widely used data label categories in practical applications, totaling 262 specific labels. These tag categories cover multiple important areas such as socio-economic, demographic, and industrial development. In addition, in order to further enhance the diversity and practicality of the dataset, this paper sets 10 different numerical dimensions. These numerical dimensions not only provide a rich range of values, but also include multiple types of values, which can simulate various data distributions and changes that may be encountered in real application scenarios. This dataset has carefully designed various types of Chinese bar charts to cover various situations that may be encountered in practical applications. Specifically, the dataset not only includes conventional vertical and horizontal bar charts, but also introduces more challenging stacked bar charts to test the performance of the method on charts of different complexities. In addition, to further increase the diversity and practicality of the dataset, the text sets diverse attribute labels for each chart type. These attribute labels include but are not limited to whether they have data labels, whether the text is rotated 45 °, 90 °, etc. The addition of these details makes the dataset more realistic for real-world application scenarios, while also placing higher demands on data extraction methods. In addition to the charts themselves, the dataset also provides corresponding data tables and title text for each chart, which is crucial for understanding the content of the chart and verifying the accuracy of the extracted results. This dataset selects Matplotlib, the most popular and widely used data visualization library in the Python programming language, to be responsible for generating chart images required for research. Matplotlib has become the preferred tool for data scientists and researchers in data visualization tasks due to its rich features, flexible configuration options, and excellent compatibility. By utilizing the Matplotlib library, every detail of the chart can be precisely controlled, from the drawing of data points to the annotation of coordinate axes, from the addition of legends to the setting of titles, ensuring that the generated chart images not only meet the research needs, but also have high readability and attractiveness visually. The dataset consists of 58712 pairs of Chinese bar charts and corresponding data tables, divided into training, validation, and testing sets in a 7:2:1 ratio.

Search
Clear search
Close search
Google apps
Main menu