100+ datasets found
  1. Supplementary material from "Visual comparison of two data sets: Do people...

    • figshare.com
    xlsx
    Updated Mar 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Kramer; Caitlin Telfer; Alice Towler (2017). Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?" [Dataset]. http://doi.org/10.6084/m9.figshare.4751095.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 14, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Robin Kramer; Caitlin Telfer; Alice Towler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.

  2. f

    Dataset for: Comparison of Two Correlated ROC Surfaces at a Given Pair of...

    • wiley.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonidas Bantis; Ziding Feng (2023). Dataset for: Comparison of Two Correlated ROC Surfaces at a Given Pair of True Classification Rates [Dataset]. http://doi.org/10.6084/m9.figshare.6527219.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wiley
    Authors
    Leonidas Bantis; Ziding Feng
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The receiver operating characteristics (ROC) curve is typically employed when one wants to evaluate the discriminatory capability of a continuous or ordinal biomarker in the case where two groups are to be distinguished, commonly the ’healthy’ and the ’diseased’. There are cases for which the disease status has three categories. Such cases employ the (ROC) surface, which is a natural generalization of the ROC curve for three classes. In this paper, we explore new methodologies for comparing two continuous biomarkers that refer to a trichotomous disease status, when both markers are applied to the same patients. Comparisons based on the volume under the surface have been proposed, but that measure is often not clinically relevant. Here, we focus on comparing two correlated ROC surfaces at given pairs of true classification rates, which are more relevant to patients and physicians. We propose delta-based parametric techniques, power transformations to normality, and bootstrap-based smooth nonparametric techniques to investigate the performance of an appropriate test. We evaluate our approaches through an extensive simulation study and apply them to a real data set from prostate cancer screening.

  3. Statistical Comparison of Two ROC Curves

    • figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaacov Petscher (2023). Statistical Comparison of Two ROC Curves [Dataset]. http://doi.org/10.6084/m9.figshare.860448.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Yaacov Petscher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This excel file will do a statistical tests of whether two ROC curves are different from each other based on the Area Under the Curve. You'll need the coefficient from the presented table in the following article to enter the correct AUC value for the comparison: Hanley JA, McNeil BJ (1983) A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148:839-843.

  4. Two Distributions Comparison

    • kaggle.com
    zip
    Updated Nov 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rodríguez Segado (2020). Two Distributions Comparison [Dataset]. https://www.kaggle.com/datasets/baetulo/two-distributions-comparison/code
    Explore at:
    zip(15517 bytes)Available download formats
    Dataset updated
    Nov 1, 2020
    Authors
    David Rodríguez Segado
    Description

    Context

    There are two distributions to compare, how you would do it? Do they come from the same distribution?

    Content

    In this exercise, we would like you to find a quantitative approach to compare the two distributions shown in the figure below. The red curve has 500 samples and the turquoise curve has 10000 samples.

    The question we would like to answer: Are the samples in one distribution larger than inthe other distribution? Here are some hints:

    1. Which metric would you use to compare the two distributions?

    2. Is there a graphical way to compare the distributions?

    3. How would you design a statistical test for this problem? How would you overcome the problem of the different sample size?

  5. Data from: Dataset for the comparison of two Computational Thinking (CT)...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laila El-Hamamsy; Laila El-Hamamsy; María Zapata-Cáceres; María Zapata-Cáceres; Pedro Marcelino; Pedro Marcelino; Jessica Dehler Zufferey; Jessica Dehler Zufferey; Barbara Bruno; Barbara Bruno; Estefanía Martín Barroso; Estefanía Martín Barroso; ‪Marcos Román-González‬; ‪Marcos Román-González‬ (2022). Dataset for the comparison of two Computational Thinking (CT) test for upper primary school (grades 3-4) : the Beginners' CT test (BCTt) and the competent CT test (cCTt) [Dataset]. http://doi.org/10.5281/zenodo.5885034
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laila El-Hamamsy; Laila El-Hamamsy; María Zapata-Cáceres; María Zapata-Cáceres; Pedro Marcelino; Pedro Marcelino; Jessica Dehler Zufferey; Jessica Dehler Zufferey; Barbara Bruno; Barbara Bruno; Estefanía Martín Barroso; Estefanía Martín Barroso; ‪Marcos Román-González‬; ‪Marcos Román-González‬
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains quantitative student data acquired during the administration of two validated Computational Thinking (CT) assessments for upper primary school (grades 3 and 4): the Beginners' CT test (BCTt) [1] and the comptent CT test (cCTt) [2]

    To compare the psychometric properties of both instruments a comparative analysis was conducted with data acquired in schools in Portugal from the same school districts. More specifically, we analyse the results of:

    - the BCTt test administered in March 2020 to 374 students in grades 3-4,

    - the cCTt test administered in April 2021 to 201 different students in grades 3-4.

    These students had no prior experience in Computational Thinking, as this was not part of the national curriculum at the times of administration.

    The detailed psychometric comparison is published in Frontiers in Psychology - Educational Psychology [3] and provides indications regarding the use of both instruments for grades 3-4.

    A README is included and provides additional information regarding :

    - the requirements for re-use.

    - the specific content of the 2 csv files

    The BCTt is available upon request to maria.zapata@urjc.es and the cCTt items are available in [2] with an editable version being available upon request to laila.elhamamsy@epfl.ch.

    In case of other inquiries, please contact: laila.elhamamsy@epfl.ch, maria.zapata@urjc.es or pedro.marcelino@treetree2.org

    References

    [1] M. Zapata-Cáceres, E. Martín-Barroso and M. Román-González, "Computational Thinking Test for Beginners: Design and Content Validation," 2020 IEEE Global Engineering Education Conference (EDUCON), 2020, pp. 1905-1914, doi: 10.1109/EDUCON45650.2020.9125368.

    [2] El-Hamamsy, L., Zapata-Cáceres, M., Barroso, E. M., Mondada, F., Zufferey, J. D., & Bruno, B. (2022). The Competent Computational Thinking Test: Development and Validation of an Unplugged Computational Thinking Test for Upper Primary School. Journal of Educational Computing Research, 60(7), 1818–1866. https://doi.org/10.1177/07356331221081753

    [3] Laila El-Hamamsy* , María Zapata-Cáceres, Pedro Marcelino, Jessica Dehler Zufferey, Barbara Bruno, Estefanía Martín-Barroso and Marcos Román-González (2022). Comparing the psychometric properties of two primary school Computational Thinking (CT) assessments for grades 3 and 4: the Beginners' CT test (BCTt) and the competent CT test (cCTt). Front. Psychol. doi:10.3389/fpsyg.2022.1082659

  6. Statistical Analysis of Individual Participant Data Meta-Analyses: A...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    tiff
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gavin B. Stewart; Douglas G. Altman; Lisa M. Askie; Lelia Duley; Mark C. Simmonds; Lesley A. Stewart (2023). Statistical Analysis of Individual Participant Data Meta-Analyses: A Comparison of Methods and Recommendations for Practice [Dataset]. http://doi.org/10.1371/journal.pone.0046042
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Gavin B. Stewart; Douglas G. Altman; Lisa M. Askie; Lelia Duley; Mark C. Simmonds; Lesley A. Stewart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundIndividual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way. Methods and FindingsWe included data from 24 randomised controlled trials, evaluating antiplatelet agents, for the prevention of pre-eclampsia in pregnancy. We performed two-stage and one-stage IPD meta-analyses to estimate overall treatment effect and to explore potential treatment interactions whereby particular types of women and their babies might benefit differentially from receiving antiplatelets. Two-stage and one-stage approaches gave similar results, showing a benefit of using anti-platelets (Relative risk 0.90, 95% CI 0.84 to 0.97). Neither approach suggested that any particular type of women benefited more or less from antiplatelets. There were no material differences in results between different types of one-stage model. ConclusionsFor these data, two-stage and one-stage approaches to analysis produce similar results. Although one-stage models offer a flexible environment for exploring model structure and are useful where across study patterns relating to types of participant, intervention and outcome mask similar relationships within trials, the additional insights provided by their usage may not outweigh the costs of statistical support for routine application in syntheses of randomised controlled trials. Researchers considering undertaking an IPD meta-analysis should not necessarily be deterred by a perceived need for sophisticated statistical methods when combining information from large randomised trials.

  7. Persian Answer Selection Datasets

    • kaggle.com
    zip
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negin Abadani (2022). Persian Answer Selection Datasets [Dataset]. https://www.kaggle.com/datasets/neginabadani/persian-answer-selection-datasets
    Explore at:
    zip(21613915 bytes)Available download formats
    Dataset updated
    Jul 10, 2022
    Authors
    Negin Abadani
    Description

    Persian Answer Selection Datasets

    Recent developments in Answer Selection (QA) have improved state-of-the-art results, and various datasets have been released for this task. Due to the lack of Persian factoid open-domain datasets, less research has been done on the latter language, making comparisons difficult. We present four Persian Answer Selection datasets, three of which have been generated by translating three benchmark datasets (WikiQA, TrecQA Raw, TrecQA Clean) and one has been generated using a native Question Answering dataset for Persian QA, namely PersianQuAD. In addition, we have trained two baseline models, i.e., BERT and RoBERTa, on each dataset.

    Dataset

    The statistics of the four datasets are shown below: | Dataset | Split | No. of questions | No. of candidate answers | Average No. of candidate answers | |--------------|-------|------------------|--------------------------|----------------------------------| | TrecQA-Raw | Train | 1229 | 53417 | 43.46 | | TrecQA-Raw | Dev | 81 | 1148 | 14.97 | | TrecQA-Raw | Test | 95 | 1517 | 15.97 | | TrecQA-Clean | Train | 1229 | 53147 | 43.46 | | TrecQA-Clean | Dev | 65 | 1117 | 17.18 | | TrecQA-Clean | Test | 68 | 1442 | 21.21 | | WikiQA | Train | 2118 | 20360 | 9.61 | | WikiQA | Dev | 296 | 2733 | 9.23 | | WikiQA | Test | 633 | 6165 | 9.74 | | PersianQuAD | Train | 14078 | 110321 | 7.84 | | PersianQuAD | Dev | 3499 | 27759 | 7.93 | | PersianQuAD | Test | 996 | 7753 | 7.78 |

    Evalution

    In order to evaluate the quality of our dataset and compare it with the benchmark English datatsets, two QA models have been trained. The pre-trained BERT (ParsBERT for the Persian datasets) and RoBERTa models have been fine-tuned on all datasets. We evaluate each model using two widely used automatic evaluation metrics MAP and MRR.

    DatasetModelMAPMRR
    TrecQA-RawBERT0.9130.960
    TrecQA-RawRoBERTa0.9270.970
    TrecQA-Raw-TranslatedBERT0.8550.921
    TrecQA-Raw-TranslatedRoBERTa0.8290.887
    TrecQA-CleanBERT0.8860.947
    TrecQA-CleanRoBERTa0.9050.961
    TrecQA-Clean-TranslatedBERT0.8150.895
    TrecQA-Clean-TranslatedRoBERTa0.7740.854
    WikiQABERT0.7970.809
    WikiQARoBERTa0.8110.823
    WikiQA-TranslatedBERT0.7290.744
    WikiQA-TranslatedRoBERTa0.6700.686
    PersianQuADBERT0.8460.867
    PersianQuADRoBERTa0.8050.827

    Citation

    Plain

    Will be added soon...

    Bibtex

    Will be added soon...

  8. f

    Statistics of cricket dataset.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shihab Ahmed; Moythry Manir Samia; Maksuda Haider Sayma; Md. Mohsin Kabir; M. F. Mridha (2024). Statistics of cricket dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0308050.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Shihab Ahmed; Moythry Manir Samia; Maksuda Haider Sayma; Md. Mohsin Kabir; M. F. Mridha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the surge in reviews and comments on newspapers and social media has made sentiment analysis a focal point of interest for researchers. Sentiment analysis is also gaining popularity in the Bengali language. However, Aspect-Based Sentiment Analysis is considered a difficult task in the Bengali language due to the shortage of perfectly labeled datasets and the complex variations in the Bengali language. This study used two open-source benchmark datasets of the Bengali language, Cricket, and Restaurant, for our Aspect-Based Sentiment Analysis task. The original work was based on the Random Forest, Support Vector Machine, K-Nearest Neighbors, and Convolutional Neural Network models. In this work, we used the Bidirectional Encoder Representations from Transformers, the Robustly Optimized BERT Approach, and our proposed hybrid transformative Random Forest and Bidirectional Encoder Representations from Transformers (tRF-BERT) models to compare the results with the existing work. After comparing the results, we can clearly see that all the models used in our work achieved better results than any of the previous works on the same dataset. Amongst them, our proposed transformative Random Forest and Bidirectional Encoder Representations from Transformers achieved the highest F1 score and accuracy. The accuracy and F1 score of aspect detection for the Cricket dataset were 0.89 and 0.85, respectively, and for the Restaurant dataset were 0.92 and 0.89 respectively.

  9. COVID-19 and Influenza | New York Datasets

    • kaggle.com
    zip
    Updated May 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angel Henriquez (2020). COVID-19 and Influenza | New York Datasets [Dataset]. https://www.kaggle.com/datasets/angelhenriquez1/covid19-influenza-newyorkdatasets/discussion
    Explore at:
    zip(648794 bytes)Available download formats
    Dataset updated
    May 9, 2020
    Authors
    Angel Henriquez
    Description

    Context

    New York has presented the most cases compared to all states across the U.S..There have also been critiques regarding how much more unnoticed impact the flu has caused. My dataset allows us to compare whether or not this is true according to the most recent data.

    Content

    This COVID-19 data is from Kaggle whereas the New York influenza data comes from the U.S. government health data website. I merged the two datasets by county and FIPS code and listed the most recent reports of 2020 COVID-19 cases and deaths alongside the 2019 known influenza cases for comparison.

    Acknowledgements

    I am thankful to Kaggle and the U.S. government for making the data that made this possible openly available.

    Inspiration

    This data can be extended to answer the common misconceptions of the scale of the COVID-19 and common flu. My inspiration stems from supporting conclusions with data rather than simply intuition.

    I would like my data to help answer how we can make U.S. citizens realize what diseases are most impactful.

  10. d

    Data from: Temporal and Spatio-Temporal High-Resolution Satellite Data for...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Temporal and Spatio-Temporal High-Resolution Satellite Data for the Validation of a Landsat Time-Series of Fractional Component Cover Across Western United States (U.S.) Rangelands [Dataset]. https://catalog.data.gov/dataset/temporal-and-spatio-temporal-high-resolution-satellite-data-for-the-validation-of-a-landsa
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Western United States
    Description

    Western U.S. rangelands have been quantified as six fractional cover (0-100%) components over the Landsat archive (1985-2018) at 30-m resolution, termed the “Back-in-Time” (BIT) dataset. Robust validation through space and time is needed to quantify product accuracy. We leverage field data observed concurrently with HRS imagery over multiple years and locations in the Western U.S. to dramatically expand the spatial extent and sample size of validation analysis relative to a direct comparison to field observations and to previous work. We compare HRS and BIT data in the corresponding space and time. Our objectives were to evaluate the temporal and spatio-temporal relationships between HRS and BIT data, and to compare their response to spatio-temporal variation in climate. We hypothesize that strong temporal and spatio-temporal relationships will exist between HRS and BIT data and that they will exhibit similar climate response. We evaluated a total of 42 HRS sites across the western U.S. with 32 sites in Wyoming, and 5 sites each in Nevada and Montana. HRS sites span a broad range of vegetation, biophysical, climatic, and disturbance regimes. Our HRS sites were strategically located to collectively capture the range of biophysical conditions within a region. Field data were used to train 2-m predictions of fractional component cover at each HRS site and year. The 2-m predictions were degraded to 30-m, and some were used to train regional Landsat-scale, 30-m, “base” maps of fractional component cover representing circa 2016 conditions. A Landsat-imagery time-series spanning 1985-2018, excluding 2012, was analyzed for change through time. Pixels and times identified as changed from the base were trained using the base fractional component cover from the pixels identified as unchanged. Changed pixels were labeled with the updated predictions, while the base was maintained in the unchanged pixels. The resulting BIT suite includes the fractional cover of the six components described above for 1985-2018. We compare the two datasets, HRS and BIT, in space and time. Two tabular data presented here correspond to a temporal and spatio-temporal validation of the BIT data. First, the temporal data are HRS and BIT component cover and climate variable means by site by year. Second, the spatio-temporal data are HRS and BIT component cover and associated climate variables at individual pixels in a site-year.

  11. d

    Data from: Data Used to Compare Photo-Interpreted and IfSAR-Derived Maps of...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data Used to Compare Photo-Interpreted and IfSAR-Derived Maps of Polar Bear Denning Habitat for the 1002 Area of the Arctic National Wildlife Refuge, Alaska, 2006-2016 [Dataset]. https://catalog.data.gov/dataset/data-used-to-compare-photo-interpreted-and-ifsar-derived-maps-of-polar-bear-denning-h-2006
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Arctic, Arctic National Wildlife Refuge, Alaska
    Description

    These are geospatial data that characterize the distribution of polar bear denning habitat in the 1002 Area of the Arctic National Wildlife Refuge, Alaska. They were generated to compare the efficacy of two different techniques for identifying areas with suitable den habitat: (1) from a previously published study (Durner et al., 2006) that used manual interpretation of aerial photos and (2) from computer interrogation of interferometric synthetic aperture radar (IfSAR) digital terrain models. Two datasets are included in this data package, they are both vector geospatial datasets of putative denning habitat (one dataset each for the manual photo interpretation data and the computer interpreted IfSAR data). Additionally included are: vector data used for sampling and metadata describing the IfSAR-derived digital terrain model (DTM) tiles used to generate the shapefiles. The IfSAR DTM are available for purchase through Intermap Technologies, Inc. All vector data are provided in both ESRI shapefile and Keyhole Markup Language (KML) formats.

  12. Public Health Indicators in Chicago

    • kaggle.com
    zip
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Public Health Indicators in Chicago [Dataset]. https://www.kaggle.com/datasets/thedevastator/public-health-indicators-in-chicago
    Explore at:
    zip(5864 bytes)Available download formats
    Dataset updated
    Jan 24, 2023
    Authors
    The Devastator
    Area covered
    Chicago
    Description

    Public Health Indicators in Chicago

    Natality, Mortality, Infectious Disease, Lead Poisoning and Economic Status

    By City of Chicago [source]

    About this dataset

    This public health dataset contains a comprehensive selection of indicators related to natality, mortality, infectious disease, lead poisoning, and economic status from Chicago community areas. It is an invaluable resource for those interested in understanding the current state of public health within each area in order to identify any deficiencies or areas of improvement needed.

    The data includes 27 indicators such as birth and death rates, prenatal care beginning in first trimester percentages, preterm birth rates, breast cancer incidences per hundred thousand female population, all-sites cancer rates per hundred thousand population and more. For each indicator provided it details the geographical region so that analyses can be made regarding trends on a local level. Furthermore this dataset allows various stakeholders to measure performance along these indicators or even compare different community areas side-by-side.

    This dataset provides a valuable tool for those striving toward better public health outcomes for the citizens of Chicago's communities by allowing greater insight into trends specific to geographic regions that could potentially lead to further research and implementation practices based on empirical evidence gathered from this comprehensive yet digestible selection of indicators

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use this dataset effectively to assess the public health of a given area or areas in the city: - Understand which data is available: The list of data included in this dataset can be found above. It is important to know all that are included as well as their definitions so that accurate conclusions can be made when utilizing the data for research or analysis. - Identify areas of interest: Once you are familiar with what type of data is present it can help to identify which community areas you would like to study more closely or compare with one another. - Choose your variables: Once you have identified your areas it will be helpful to decide which variables are most relevant for your studies and research specific questions regarding these variables based on what you are trying to learn from this data set.
    - Analyze the Data : Once your variables have been selected and clarified take right into analyzing the corresponding values across different community areas using statistical tests such as t-tests or correlations etc.. This will help answer questions like “Are there significant differences between two outputs?” allowing you to compare how different Chicago Community Areas stack up against each other with regards to public health statistics tracked by this dataset!

    Research Ideas

    • Creating interactive maps that show data on public health indicators by Chicago community area to allow users to explore the data more easily.
    • Designing a machine learning model to predict future variations in public health indicators by Chicago community area such as birth rate, preterm births, and childhood lead poisoning levels.
    • Developing an app that enables users to search for public health information in their own community areas and compare with other areas within the city or across different cities in the US

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: public-health-statistics-selected-public-health-indicators-by-chicago-community-area-1.csv | Column name | Description | |:-----------------------------------------------|:--------------------------------------------------------------------------------------------------| | Community Area | Unique identifier for each community area in Chicago. (Integer) | | Community Area Name | Name of the community area in Chicago. (String) | | Birth Rate | Number of live births per 1,000 population. (Float) | | General Fertility Rate | Number of live births per 1,000 women aged 15-44. (Float) ...

  13. H

    Replication Data for: Exploring Disagreement in Indicators of State...

    • datasetcatalog.nlm.nih.gov
    • dataverse.harvard.edu
    • +2more
    Updated May 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crabtree, Charles (2018). Replication Data for: Exploring Disagreement in Indicators of State Repression [Dataset]. http://doi.org/10.7910/DVN/V5LB9K
    Explore at:
    Dataset updated
    May 30, 2018
    Authors
    Crabtree, Charles
    Description

    Until recently, researchers who wanted to examine the determinants of state respect for most specific negative rights needed to rely on data from the CIRI or the Political Terror Scale (PTS). The new V-DEM dataset offers scholars a potential alternative to the individual human rights variables from CIRI. We analyze a set of key Cingranelli-Richards (CIRI) Human Rights Data Project and Varieties of Democracy (V-DEM) negative rights indicators, finding unusual and unexpectedly large patterns of disagreement between the two sets. First, we discuss the new V-DEM dataset by comparing it to the disaggregated CIRI indicators, discussing the history of each project, and describing its empirical domain. Second, we identify a set of disaggregated human rights measures that are similar across the two datasets and discuss each project's measurement approach. Third, we examine how these measures compare to each other empirically, showing that they diverge considerably across both time and space. These findings point to several important directions for future work, such as how conceptual approaches and measurement strategies affect rights scores. For the time being, our findings suggest that researchers should think carefully about using the measures as substitutes.

  14. Social Media Posts - Fortune 1000 Companies

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarred Gaudineer (2025). Social Media Posts - Fortune 1000 Companies [Dataset]. https://www.kaggle.com/datasets/jarredgaudineer/social-media-posts-fortune-1000-companies
    Explore at:
    zip(4523525162 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Authors
    Jarred Gaudineer
    Description

    Update 17Mar2025: I'm working scraped BlueSky posts into this dataset as well. If there is enough of them, I will start to stop scraping X posts. I'm not convinced that X posts represent public sentiment at this time.

    About Dataset

    Context This is a dataset of X and Reddit posts and comments mentioning Fortune 1000 companies. It contains several hundred thousand posts and comments extracted using the X and Reddit api.

    Content It contains the following fields:

    id: Unique ID assigned to each post and comment.

    text: The text of the post or comment.

    author: Unique identifier of the author of the post.

    created at: Date on which the post or comment was made.

    likes: post likes

    retweets (X only): retweets

    replies (X only): post replies

    views: post views

    engagement_rate: represents the relative engagement of the post or comment.

    subreddit (Reddit only): identified the subReddit from which the post or comment came.

    score (Reddit only): Total of upvotes and downvotes.

    upvote_ratio (Reddit only): Upvote to downvote ratio

    num_comments (Reddit only): Number of post comments.

    Methods Scraping runs 24/7. Data is compiled into the dataset once per business day. Posts are scraped from Reddit and X, but only Reddit is scraped for comments. Comments are only scraped if they are on a post that mentions a Fortune 1000 company, and only if they also mention a Fortune 1000 company.

    Each business day, raw data is compiled into a dataset file. Those are the files posted here, labelled with the date they were compiled. At compilation, data is deduplicated, and all posts and comments older than 60 days are deleted. Hence, if you compare two dataset files posted here, there will be data overlap. If you would like data from a date range wider than 60 days, you will need to dedeplicate between files.

    Citation All content belongs to the original authors. I neither own nor claim any part of this dataset. All posts contained in this dataset were public at the time of capture. Please contact me to have any content removed.

    You are free to use this dataset for any legal, noncommercial purpose. It is not necessary to cite this dataset, but if you wish to, you can cite:

    Gaudineer, J. L., 2025. Social Media Posts- Fortune 1000 Companies.

  15. Discovering Hidden Trends in Global Video Games

    • kaggle.com
    zip
    Updated Dec 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Discovering Hidden Trends in Global Video Games [Dataset]. https://www.kaggle.com/datasets/thedevastator/discovering-hidden-trends-in-global-video-games
    Explore at:
    zip(57229 bytes)Available download formats
    Dataset updated
    Dec 3, 2022
    Authors
    The Devastator
    Description

    Discovering Hidden Trends in Global Video Games Sales

    Platforms, Genres, and Profitable Regions

    By Andy Bramwell [source]

    About this dataset

    This dataset contains sales data for video games from all around the world, across different platforms, genres and regions. From the thought-provoking latest release of RPGs to the thrilling adventures of racing games, this database provides an insight into what constitutes as a hit game in today’s gaming industry. Armed with this data and analysis, future developers can better understand what types of gameplay and mechanics resonate more with players to create a new gaming experience. Through its comprehensive analysis on various game titles, genres and platforms this dataset displays detailed insights into how video games can achieve global success as well as providing a wonderful window into the ever-changing trends of gaming culture

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to uncover hidden trends in Global Video Games Sales. To make the most of this data, it is important to understand the different columns and their respective values.

    The 'Rank' column identifies each game's ranking according to its global sales (highest to lowest). This can help you identify which games are most popular globally. The 'Game Title' column contains the name of each video game, which allows you to easily discern one entry from another. The 'Platform' column lists the type of platform on which each game was released, e.g., PlayStation 4 or Xbox One, so that you can make comparisons between platforms as well as specific games for each platform. The 'Year' column provides an additional way of making year-on-year comparisons and tracking changes over time in global video game sales.
    In addition, this dataset also contains metadata such as genre ('Genre'), publisher ('Publisher'), and review score ('Review') that add context when considering a particular title's performance in terms of global sales rankings. For example, it might be more compelling to compare two similar genres than two disparate ones when analyzing how successful a select set of titles have been at generating revenue in comparison with others released globally within that timeline. Lastly but no less important are the three variables dedicated exclusively for geographic breakdowns: North America ('North America'), Europe (Europe), Japan (Japan), Rest of World (Rest of World), and Global (Global). This allows us to see how certain regions contribute individually or collectively towards a given title's overall sales figures; by comparing these metrics regionally or collectively an interesting picture arises -- from which inferences about consumer preferences and supplier priorities emerge!

    Overall this powerful dataset allows researchers and marketers alike a deep dive into market performance for those persistent questions about demand patterns across demographics around the world!

    Research Ideas

    • Analyzing the effects of genre and platform on a game's success - By comparing different genres and platforms, one can get a better understanding of what type of games have the highest sales in different regions across the globe. This could help developers decide which type of gaming content to create in order to maximize their profits.
    • Tracking changes in global video games trends over time - This dataset could be used to analyze how various elements such as genre or platform affect success over various years, allowing developers an inside look into what kind of videos are being favored at any given moment across the world.
    • Identifying highly successful games and their key elements- Developers could look at this data to find any common factors such as publisher or platform shared by successful titles to uncover characteristics that lead to a high rate-of-return when creating video games or other forms media entertainment

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: Video Games Sales.csv | Column name | Description | |:------------------|:------------------------------------------------------------| | Rank | The ranking of the game in terms of global sales. (Integer) | | Game Title | The title of the game. (String) | | Platform | The platform the game was released on. (String) ...

  16. 🎮 Top PC Games: Metacritic vs Steam Popularity

    • kaggle.com
    zip
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlyAhmedTS13 (2025). 🎮 Top PC Games: Metacritic vs Steam Popularity [Dataset]. https://www.kaggle.com/datasets/alyahmedts13/top-pc-games-metacritic-vs-steam-popularity
    Explore at:
    zip(3881552 bytes)Available download formats
    Dataset updated
    Nov 6, 2025
    Authors
    AlyAhmedTS13
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🕹️ About Dataset

    🎯 Context

    Is Metacritic still the ultimate standard for PC game ratings?
    How do the highest-rated games compare to what’s actually popular on Steam?

    These were the questions that inspired me to create this dataset.

    I scraped data from Metacritic to gather the top-rated PC games of all time, and used the SteamSpy API to collect data on around 10,000 of the most popular Steam games.

    Now, you can explore and analyze how critical acclaim differs from player popularity.
    Which genres dominate both lists? Do high ratings guarantee popularity? Let’s find out.

    📦 Content

    This dataset contains two separate CSV files:

    🟩 metacritic_Toppc_games.csv

    Top-rated PC games of all time from Metacritic.
    Each game includes the following attributes:

    ColumnDescription
    NameOfficial title of the game on Metacritic.
    Release_DateOriginal release date of the game.
    RatingAge or content rating (e.g., M, T, E10+).
    DescriptionShort description or summary of the game.
    ScoreMetacritic critic score (0–100).

    🔵 steam_spy_data.csv

    Contains around 10,000 popular PC games from Steam, collected using the SteamSpy API.
    Each game includes the following attributes:

    ColumnDescription
    appidUnique Steam App ID assigned to each game.
    nameOfficial game title on Steam.
    developerDeveloper or studio responsible for the game.
    publisherCompany that published the game on Steam.
    score_rankSteamSpy’s ranking score (if available).
    positiveNumber of positive user reviews.
    negativeNumber of negative user reviews.
    userscoreSteamSpy user score value (0 if unavailable).
    ownersEstimated ownership range (e.g., “1,000,000 .. 2,000,000”).
    average_foreverAverage total playtime (in minutes).
    average_2weeksAverage playtime (in minutes) during the last two weeks.
    median_foreverMedian total playtime (in minutes).
    median_2weeksMedian playtime (in minutes) during the last two weeks.
    priceCurrent game price in cents (e.g., 999 = $9.99).
    initialpriceOriginal game price before discounts (in cents).
    discountCurrent discount percentage (0–100).
    ccuCurrent number of concurrent players.

    🙏 Acknowledgements

    • The Metacritic data was scraped directly from the official Metacritic PC games page.
    • The Steam data was gathered using the SteamSpy API.
      All data is publicly available for research and educational purposes.

    💡 Inspiration

    You can use this dataset to:
    - Compare critic scores vs. player popularity
    - Analyze pricing vs. ratings or playtime
    - Explore the relationship between release year and game success
    - Identify genre trends across Metacritic and Steam

    Cleaned Versions

    Two cleaned datasets are now available: - metacritic_Toppc_games_clean.csv - steam_spy_data_clean.csv

    These files have been processed using SQL logic to ensure consistency, usability, and readiness for exploratory data analysis (EDA).

    🎮 Metacritic Top PC Games Cleaning Steps

    • Release Date Standardization

      • Converted release_Date from string format (%b %d, %Y) to SQL DATE type.
    • Score Column Cleanup

      • Removed the 'Metascore' label from the score column.
      • Converted score to INT and replaced nulls with 0.
    • Rating Imputation

      • Replaced null values in rating with 'N/A'.

    🕹️ Steam Popular Games Cleaning Steps

    • Null Handling

      • Replaced missing values in developer and publisher with 'N/A'.
    • Column Pruning

      • Dropped score_rank and userscore due to excessive missing values.
    • Price Conversion

      • Converted price and initialprice from cents to USD using TRUNCATE(price/100, 2).
      • Renamed columns for clarity:
      • priceprice_USD
      • initialpriceinitialprice_USD
      • discountdiscount_%
    • **Gamep...

  17. Search Engines Comparison and Websites Performance

    • kaggle.com
    • data.niaid.nih.gov
    zip
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgios Ntimo (2023). Search Engines Comparison and Websites Performance [Dataset]. https://www.kaggle.com/datasets/georgiosntimo/search-engines-comparison-and-websites-performance
    Explore at:
    zip(34133 bytes)Available download formats
    Dataset updated
    Jun 30, 2023
    Authors
    Georgios Ntimo
    Description

    The current dataset is consisted of 200 search results extracted from Google and Bing engines (100 of Google and 100 of Bing). The search terms are selected from the 10 most search keywords of 2021 based on the provided data of Google Trends. The rest of the sheets include the performance of the websites according to three technical evaluation aspects. That is, SEO, Speed and Security. The performance dataset has been developed through the utilization of CheckBot crawling tool. The whole dataset can help information retrieval scientists to compare the two engines in terms of their position/ranking and their performance related to these factors.

    For more information about the thinking of the of the structure of the dataset please contact the Information Management Lab of University of West Attica.

    Contact Persons: Vasilis Ntararas (lb17032@uniwa.gr) , Georgios Ntimo (lb17100@uniwa.gr) and Ioannis C. Drivas (idrivas@uniwa.gr)

  18. p

    Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual...

    • physionet.org
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyue Hu; Lin Gu; Qiyuan An; Mengliang Zhang; liangchen liu; Kazuma Kobayashi; Tatsuya Harada; Ronald Summers; Yingying Zhu (2025). Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images [Dataset]. http://doi.org/10.13026/e6dd-cn74
    Explore at:
    Dataset updated
    Feb 3, 2025
    Authors
    Xinyue Hu; Lin Gu; Qiyuan An; Mengliang Zhang; liangchen liu; Kazuma Kobayashi; Tatsuya Harada; Ronald Summers; Yingying Zhu
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    The task of Difference Visual Question Answering involves answering questions about the difference between a pair of main and reference images. This process is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We've assembled a new dataset, called Medical-Diff-VQA, for this purpose. Unlike previous medical VQA datasets, ours is the first one designed specifically for the Difference Visual Question Answering task, with questions crafted to suit the Assessment-Diagnosis-Intervention-Evaluation treatment procedure employed by medical professionals. The Medical-Diff-VQA dataset, a derivative of the MIMIC-CXR dataset, consists of questions categorized into seven categories: abnormality (145,421), location (84,193), type (27,478), level (67,296), view (56,265), presence (155,726), and difference(164,324). The 'difference' questions are specifically for comparing two images. In total, the Medical-Diff-VQA dataset contains 700,703 question-answer pairs derived from 164,324 pairs of main and reference images.

  19. O

    CT Land Cover Viewer

    • data.ct.gov
    csv, xlsx, xml
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UConn (2023). CT Land Cover Viewer [Dataset]. https://data.ct.gov/Environment-and-Natural-Resources/CT-Land-Cover-Viewer/b7c8-vb26
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset authored and provided by
    UConn
    Area covered
    Connecticut
    Description
    This viewer is available on CT ECO from UConn CLEAR. The viewer contains Connecticut's statewide land cover in an easy-to-explore tool.
    Description
    The Connecticut Land Cover Viewer contains seven dates of land cover, between 1985 and 2015 including change to and change from layers. These layers are in the bottom half of the layer list and start with a year, such as 1985 Land Cover.

    For each major land cover class (forest, ag field, developed, turf & grass) there are summary stats by town shown as layers in the viewer with color ramps. The darker the color, the higher the presence. Summary stats include change by town as well, where more change is shown with darker colors. These layers are in the top half of the viewer.

    The viewer also contains forest fragmentation layers from 1985 and 2015.

    More about land cover on the CLEAR website, including Number and Charts data visualizations.

    Use
    To use the viewer, use the Layer List (upper right) to turn on and off layers (remember to turn OFF the ones above on the list or they will hide layers below) to compare and explore the area. The swipe tool (lower left) is a fun way to compare two datasets. Be sure at least two items are checked on in the layer list and use the swipe tool to compare. Refer to Viewer Help for more details and tips.

    Tips
    - compare dates of land cover by turning them on and off in the layer list, or using they swipe tool
    - for any year of land cover (the layers are called 1985 Land Cover, 1990 Land Cover, etc.), click on the little arrow to the left in the table of contents to see layers that show just main land cover classes. This is a good way to explore just forest land cover or just developed land cover - you get the point!
    - land cover from satellite imagery at this resolution can look "fuzzy" compared to high resolution datasets. It is coarse but is an excellent, and perhaps the only way, to look at change over 30 years.
    - visit the Land Cover FAQs for more information.

  20. Customer Satisfaction Scores and Behavior Data

    • kaggle.com
    zip
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salahuddin Ahmed (2025). Customer Satisfaction Scores and Behavior Data [Dataset]. https://www.kaggle.com/datasets/salahuddinahmedshuvo/customer-satisfaction-scores-and-behavior-data/discussion
    Explore at:
    zip(2456 bytes)Available download formats
    Dataset updated
    Apr 6, 2025
    Authors
    Salahuddin Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains customer satisfaction scores collected from a survey, alongside key demographic and behavioral data. It includes variables such as customer age, gender, location, purchase history, support contact status, loyalty level, and satisfaction factors. The dataset is designed to help analyze customer satisfaction, identify trends, and develop insights that can drive business decisions.

    File Information: File Name: customer_satisfaction_data.csv (or your specific file name)

    File Type: CSV (or the actual file format you are using)

    Number of Rows: 120

    Number of Columns: 10

    Column Names:

    Customer_ID – Unique identifier for each customer (e.g., 81-237-4704)

    Group – The group to which the customer belongs (A or B)

    Satisfaction_Score – Customer's satisfaction score on a scale of 1-10

    Age – Age of the customer

    Gender – Gender of the customer (Male, Female)

    Location – Customer's location (e.g., Phoenix.AZ, Los Angeles.CA)

    Purchase_History – Whether the customer has made a purchase (Yes or No)

    Support_Contacted – Whether the customer has contacted support (Yes or No)

    Loyalty_Level – Customer's loyalty level (Low, Medium, High)

    Satisfaction_Factor – Primary factor contributing to customer satisfaction (e.g., Price, Product Quality)

    Statistical Analyses:

    Descriptive Statistics:

    Calculate mean, median, mode, standard deviation, and range for key numerical variables (e.g., Satisfaction Score, Age).

    Summarize categorical variables (e.g., Gender, Loyalty Level, Purchase History) with frequency distributions and percentages.

    Two-Sample t-Test (Independent t-test):

    Compare the mean satisfaction scores between two independent groups (e.g., Group A vs. Group B) to determine if there is a significant difference in their average satisfaction scores.

    Paired t-Test:

    If there are two related measurements (e.g., satisfaction scores before and after a certain event), you can compare the means using a paired t-test.

    One-Way ANOVA (Analysis of Variance):

    Test if there are significant differences in mean satisfaction scores across more than two groups (e.g., comparing the mean satisfaction score across different Loyalty Levels).

    Chi-Square Test for Independence:

    Examine the relationship between two categorical variables (e.g., Gender vs. Purchase History or Loyalty Level vs. Support Contacted) to determine if there’s a significant association.

    Mann-Whitney U Test:

    For non-normally distributed data, use this test to compare satisfaction scores between two independent groups (e.g., Group A vs. Group B) to see if their distributions differ significantly.

    Kruskal-Wallis Test:

    Similar to ANOVA, but used for non-normally distributed data. This test can compare the median satisfaction scores across multiple groups (e.g., comparing satisfaction scores across Loyalty Levels or Satisfaction Factors).

    Spearman’s Rank Correlation:

    Test for a monotonic relationship between two ordinal or continuous variables (e.g., Age vs. Satisfaction Score or Satisfaction Score vs. Loyalty Level).

    Regression Analysis:

    Linear Regression: Model the relationship between a continuous dependent variable (e.g., Satisfaction Score) and independent variables (e.g., Age, Gender, Loyalty Level).

    Logistic Regression: If analyzing binary outcomes (e.g., Purchase History or Support Contacted), you could model the probability of an outcome based on predictors.

    Factor Analysis:

    To identify underlying patterns or groups in customer behavior or satisfaction factors, you can apply Factor Analysis to reduce the dimensionality of the dataset and group similar variables.

    Cluster Analysis:

    Use K-Means Clustering or Hierarchical Clustering to group customers based on similarity in their satisfaction scores and other features (e.g., Loyalty Level, Purchase History).

    Confidence Intervals:

    Calculate confidence intervals for the mean of satisfaction scores or any other metric to estimate the range in which the true population mean might lie.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robin Kramer; Caitlin Telfer; Alice Towler (2017). Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?" [Dataset]. http://doi.org/10.6084/m9.figshare.4751095.v1
Organization logoOrganization logo

Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?"

Explore at:
xlsxAvailable download formats
Dataset updated
Mar 14, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Robin Kramer; Caitlin Telfer; Alice Towler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.

Search
Clear search
Close search
Google apps
Main menu