100+ datasets found

Data from: Regression with Empirical Variable Selection: Description of a...
plos.figshare.com
txt
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0034338
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Anne E. Goodenough; Adam G. Hart; Richard Stafford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Large Language Models Comparison Dataset
kaggle.com
zip
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samay Ashar (2025). Large Language Models Comparison Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/large-language-models-comparison-dataset
Explore at:
zip(5894 bytes)Available download formats
Dataset updated
Feb 24, 2025
Authors
Samay Ashar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comparison of various Large Language Models (LLMs) based on their performance, cost, and efficiency. It includes important details like speed, latency, benchmarks, and pricing, helping users understand how different models stack up against each other.

Key Details:

File Name: llm_comparison_dataset.csv

Size: 14.57 kB

Total Columns: 15

License: CC0 (Public Domain)

What’s Inside?

Here are some of the key metrics included in the dataset:

Context Window: Maximum number of tokens the model can process at once.

Speed (tokens/sec): How fast the model generates responses.

Latency (sec): Time delay before the model responds.

Benchmark Scores: Performance ratings from MMLU (academic tasks) and Chatbot Arena (real-world chatbot performance).

Open-Source: Indicates if the model is publicly available or proprietary.

Price per Million Tokens: The cost of using the model for one million tokens.

Training Dataset Size: Amount of data used to train the model.

Compute Power: Resources needed to run the model.

Energy Efficiency: How much power the model consumes.

This dataset is useful for researchers, developers, and AI enthusiasts who want to compare LLMs and choose the best one based on their needs.

📌If you find this dataset useful, do give an upvote :)
r
The banksia plot: a method for visually comparing point estimates and...
researchdata.edu.au
datasetcatalog.nlm.nih.gov
+1more
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Turner; Joanne McKenzie; Emily Karahalios; Elizabeth Korevaar (2024). The banksia plot: a method for visually comparing point estimates and confidence intervals across datasets [Dataset]. http://doi.org/10.26180/25286407.V2
Explore at:
Unique identifier
https://doi.org/10.26180/25286407.V2
Dataset updated
Apr 16, 2024
Dataset provided by
Monash University
Authors
Simon Turner; Joanne McKenzie; Emily Karahalios; Elizabeth Korevaar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Companion data for the creation of a banksia plot:
Background:
In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.
Methods:
The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.
Results:
In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.
Conclusions:
The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.
This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Reddit /r/datasets Dataset
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
Explore at:
zip(9619636 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

By SocialGrep [source]

About this dataset

A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

Research Ideas

Finding correlations between different types of datasets

Determining which datasets are most popular on Reddit

Analyzing the sentiments of post and comments on Reddit's /r/datasets board

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s003
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
u
Surrogate flood model comparison - Datasets and python code
figshare.unimelb.edu.au
bin
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Fraehr (2024). Surrogate flood model comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/24312658.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.26188/24312658.v1
Dataset updated
Jan 19, 2024
Dataset provided by
The University of Melbourne
Authors
Niels Fraehr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.
Z
Data from: A 24-hour dynamic population distribution dataset based on mobile...
data.niaid.nih.gov
zenodo.org
Updated Feb 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudia Bergroth; Olle Järv; Henrikki Tenkanen; Matti Manninen; Tuuli Toivonen (2022). A 24-hour dynamic population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4724388
Explore at:
Dataset updated
Feb 16, 2022
Dataset provided by
Elisa Corporation
Department of Built Environment, Aalto University / Centre for Advanced Spatial Analysis, University College London
Digital Geography Lab, Department of Geosciences and Geography, University of Helsinki
Unit of Urban Research and Statistics, City of Helsinki / Digital Geography Lab, Department of Geosciences and Geography, University of Helsinki
Authors
Claudia Bergroth; Olle Järv; Henrikki Tenkanen; Matti Manninen; Tuuli Toivonen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Helsinki Metropolitan Area, Finland
Description
Related article: Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39.

In this dataset:

We present temporally dynamic population distribution data from the Helsinki Metropolitan Area, Finland, at the level of 250 m by 250 m statistical grid cells. Three hourly population distribution datasets are provided for regular workdays (Mon – Thu), Saturdays and Sundays. The data are based on aggregated mobile phone data collected by the biggest mobile network operator in Finland. Mobile phone data are assigned to statistical grid cells using an advanced dasymetric interpolation method based on ancillary data about land cover, buildings and a time use survey. The data were validated by comparing population register data from Statistics Finland for night-time hours and a daytime workplace registry. The resulting 24-hour population data can be used to reveal the temporal dynamics of the city and examine population variations relevant to for instance spatial accessibility analyses, crisis management and planning.

Please cite this dataset as:

Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39. https://doi.org/10.1038/s41597-021-01113-4

Organization of data

The dataset is packaged into a single Zipfile Helsinki_dynpop_matrix.zip which contains following files:

HMA_Dynamic_population_24H_workdays.csv represents the dynamic population for average workday in the study area.

HMA_Dynamic_population_24H_sat.csv represents the dynamic population for average saturday in the study area.

HMA_Dynamic_population_24H_sun.csv represents the dynamic population for average sunday in the study area.

target_zones_grid250m_EPSG3067.geojson represents the statistical grid in ETRS89/ETRS-TM35FIN projection that can be used to visualize the data on a map using e.g. QGIS.

Column names

YKR_ID : a unique identifier for each statistical grid cell (n=13,231). The identifier is compatible with the statistical YKR grid cell data by Statistics Finland and Finnish Environment Institute.

H0, H1 ... H23 : Each field represents the proportional distribution of the total population in the study area between grid cells during a one-hour period. In total, 24 fields are formatted as “Hx”, where x stands for the hour of the day (values ranging from 0-23). For example, H0 stands for the first hour of the day: 00:00 - 00:59. The sum of all cell values for each field equals to 100 (i.e. 100% of total population for each one-hour period)

In order to visualize the data on a map, the result tables can be joined with the target_zones_grid250m_EPSG3067.geojson data. The data can be joined by using the field YKR_ID as a common key between the datasets.

License Creative Commons Attribution 4.0 International.

Related datasets

Järv, Olle; Tenkanen, Henrikki & Toivonen, Tuuli. (2017). Multi-temporal function-based dasymetric interpolation tool for mobile phone data. Zenodo. https://doi.org/10.5281/zenodo.252612

Tenkanen, Henrikki, & Toivonen, Tuuli. (2019). Helsinki Region Travel Time Matrix [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3247564
PC Component Prices Comparison
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). PC Component Prices Comparison [Dataset]. https://www.kaggle.com/datasets/thedevastator/pc-component-prices-comparison
Explore at:
zip(328042 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
PC Component Prices Comparison

Detailed Prices, Scores, and Reviews from Different Brands and Categories

By [source]

About this dataset

This dataset is the perfect resource to analyze and compare prices, scores, and reviews of a wide assortment of PC components across various brands and categories. Utilize this data to make well-informed decisions regarding your purchase. Our data set includes the latest timestamps collected from industry leaders, so you can always be sure that what you’re seeing is up-to-date. Specifically this database includes every insightful detail such as company name, brand name, category information along with product numbers as well as price points both on individual products as well as averages by brand and category for easier comparison shopping. Furthermore users are able to see in depth scores for each product combined with image urls for easy research and above all reviews from other users who have used the same product before you decide on anything making it easy to decide if the price tag or quality of the product is worth it or not. Let our comprehensive comparison save you time and hassle when buying your next PC component!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive comparison of prices, scores, reviews, product numbers and more from a variety of different PC components from different brands and categories. In order to use this dataset effectively and gain the best insights, it is important to be aware of the following points:

1. Categories: The data can be filtered by category so if you are looking for particular type of component like graphic card or processor you can filter out easily. Here’s how you can do that: Sort the data by category column (in Ascending Order) and all items related to the same categories will appear together in a consecutive sequence.

2. Brands: To get an overview of brands related to a particular category check out brand_name column; Here you will see all different brands sorted under each category along with their corresponding prices and scores etc., Similarly, You can also sort them in ascending/descending order based on price or score or any other parameter mentioned above in our Columns section

3. Filtering Your Results : Sort your results according to many factors like price, reviews ,score. For example if your preference as per budget is low price then sort by using price column in ascending order & similarly for with regards to quality score then sort using score column descending order & if reviews highly matters then go with descending order under ‘reviews’ columns

4 Searching Desired Component(PC): Where ever applicable text search feature also available such as searching for desired prefixed term may appear at most times;it filters any matching pattern noted within product_number field Therefore try including appropriate terms before running searches. Excluding dates from your terms might help getting best possible outcomes

Research Ideas

Use this dataset to compare PC components across various brands, categories and prices to create personalized PC builds with the most cost-efficient components.

Compare the scores and reviews of different products to identify the best value for money options across categories, brands, etc.

Analyze the dataset to find trends in terms of which companies provide better performance at lower prices or have a higher proportion of positive reviews over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: ComponentesPC_Scraper_DataSet.csv | Column name | Description | |:-------------------|:------------------------------------------------------------| | timestamp | Date and time of the data entry. (DateTime) | | company_name | Name of the company that manufactures the product. (String) | | brand_name | Brand name of the product. (String) | | category | Category of the p...
Benchmark Multi-Omics Datasets for Methods Comparison
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Odom; Gabriel Odom; Lily Wang; Lily Wang (2021). Benchmark Multi-Omics Datasets for Methods Comparison [Dataset]. http://doi.org/10.5281/zenodo.5683002
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5683002
Dataset updated
Nov 14, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel Odom; Gabriel Odom; Lily Wang; Lily Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pathway Multi-Omics Simulated Data

These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".

There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).

Supplemental Files

The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement
NBA Players Performance
kaggle.com
zip
Updated Dec 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). NBA Players Performance [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-secrets-of-nba-player-performance/code
Explore at:
zip(39775 bytes)Available download formats
Dataset updated
Dec 9, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
NBA Players Performance

Players Performance & Statistics

By [source]

About this dataset

This dataset contains comprehensive performance data of National Basketball Association (NBA) players during the 2019-20 season. It includes all the crucial performance metrics crucial to assess a player’s quality of play. Here, you can compare players across teams, positions and categories and gain deeper insight into their overall performance. This dataset includes useful statistics such as GP (Games Played), Player name, Position, Assists Turnovers Ratio, Blocks per Game, Fouls per Minutes Played, Rebounds per Game and more. Dive in to this detailed overview of NBA player performance and take your understanding of athletes within the organization to another level!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides an in-depth look into the performance of NBA Players throughout the 2019-20 season, allowing an informed analysis of various important statistics. There are a number of ways to use this dataset to both observe and compare players, teams and positions.

By looking at the data you can get an idea of how players are performing across all metrics. The “Points Per Game” metric is particularly useful as it allows quick comparison between different players and teams on their offensive ability. Additionally, exploratory analysis can be conducted by looking at metrics like rebounds or assists per game which allows one to make interesting observations within the game itself such as ball movement being a significant factor for team success.

This dataset also enables further comparison between players from different positions on particular metrics that might be position orientated or generic across all positions such as points per game (ppg). This includes adjusting for positional skill sets; For example guard’s field goal attempts might include more three point shots because it would benefit them more than larger forwards or centres who rely more heavily on in close shot attempts due to their size advantage over their opponents.

This dataset also allows for simple visualisation of player performance with respect to each other; For example one can view points scored against assists ratio when comparing multiple point guards etc., providing further insight into individual performances on certain metrics which otherwise could not be analysed quickly with traditional methods like statistical analysis only within similarly situated groups (e.g.: same position). Furthermore this data set could aid further research in emerging areas such as targeted marketing analytics where identify potential customers based off publically available data regarding factors like ppg et cetera which may highly affect team success orotemode profitability dynamicsincreasedancefficiencyoftheirownopponentteams etcet

Research Ideas

Develop an AI-powered recommendation system that can suggest optimal players to fill out a team based on their performances in the past season.

Examine trends in player performance across teams and positions, allowing coaches and scouts to make informed decisions when evaluating talent.

Create a web or mobile app that can compare the performances of multiple players, allowing users to explore different performance metrics head-to-head

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: assists-turnovers.csv | Column name | Description | |:--------------|:----------------------------------| | GP | Number of games played. (Integer) | | Player | Player name. (String) | | Position | Player position. (String) |

File: blocks.csv | Column name | Description | |:--------------|:----------------------------------| | GP | Number of games played. (Integer) | | Player | Player name. (String) | | Position | Player position. (String) |

File: fouls-minutes.csv | Column name | Description | |:--------------|:----------------------...
d
HTMLmetadata HTML formatted text files describing samples and spectra,...
catalog.data.gov
datasets.ai
+1more
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). HTMLmetadata HTML formatted text files describing samples and spectra, including photos [Dataset]. https://catalog.data.gov/dataset/htmlmetadata-html-formatted-text-files-describing-samples-and-spectra-including-photos
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
U.S. Geological Survey
Description
HTMLmetadata Text files in HTML-format containing metadata about samples and spectra. Also included in the zip file are folders containing information linked to from the HTML files, including: - README: contains a HTML version of the USGS Data Series publication, linked to this data release, that describes this spectral library (Kokaly and others, 2017). The folder also contains an HTML version of the release notes. - photo_images: contains full resolution images of photos of samples and field sites. - photo_thumbs: contains low-resolution thumbnail versions of photos of samples and field sites. GENERAL LIBRARY DESCRIPTION This data release provides the U.S. Geological Survey (USGS) Spectral Library Version 7 and all related documents. The library contains spectra measured with laboratory, field, and airborne spectrometers. The instruments used cover wavelengths from the ultraviolet to the far infrared (0.2 to 200 microns). Laboratory samples of specific minerals, plants, chemical compounds, and man-made materials were measured. In many cases, samples were purified, so that unique spectral features of a material can be related to its chemical structure. These spectro-chemical links are important for interpreting remotely sensed data collected in the field or from an aircraft or spacecraft. This library also contains physically-constructed as well as mathematically-computed mixtures. Measurements of rocks, soils, and natural mixtures of minerals have also been made with laboratory and field spectrometers. Spectra of plant components and vegetation plots, comprising many plant types and species with varying backgrounds, are also in this library. Measurements by airborne spectrometers are included for forested vegetation plots, in which the trees are too tall for measurement by a field spectrometer. The related U.S. Geological Survey Data Series publication, "USGS Spectral Library Version 7", describes the instruments used, metadata descriptions of spectra and samples, and possible artifacts in the spectral measurements (Kokaly and others, 2017). Four different spectrometer types were used to measure spectra in the library: (1) Beckman™ 5270 covering the spectral range 0.2 to 3 µm, (2) standard, high resolution (hi-res), and high-resolution Next Generation (hi-resNG) models of ASD field portable spectrometers covering the range from 0.35 to 2.5 µm, (3) Nicolet™ Fourier Transform Infra-Red (FTIR) interferometer spectrometers covering the range from about 1.12 to 216 µm, and (4) the NASA Airborne Visible/Infra-Red Imaging Spectrometer AVIRIS, covering the range 0.37 to 2.5 µm. Two fundamental spectrometer characteristics significant for interpreting and utilizing spectral measurements are sampling position (the wavelength position of each spectrometer channel) and bandpass (a parameter describing the wavelength interval over which each channel in a spectrometer is sensitive). Bandpass is typically reported as the Full Width at Half Maximum (FWHM) response at each channel (in wavelength units, for example nm or micron). The linked publication (Kokaly and others, 2017), includes a comparison plot of the various spectrometers used to measure the data in this release. Data for the sampling positions and the bandpass values (for each channel in the spectrometers) are included in this data release. These data are in the SPECPR files, as separate data records, and in the American Standard Code for Information Interchange (ASCII) text files, as separate files for wavelength and bandpass. Spectra are provided in files of ASCII text format (files with a .txt file extension). In the ASCII files, deleted channels (bad bands) are indicated by a value of -1.23e34. Metadata descriptions of samples, field areas, spectral measurements, and results from supporting material analyses – such as XRD – are provided in HyperText Markup Language HTML formatted ASCII text files (files with .html file extension). In addition, Graphics Interchange Format (GIF) images of plots of spectra are provided. For each spectrum a plot with wavelength in microns on the x-axis is provided. For spectra measured on the Nicolet spectrometer, an additional GIF image with wavenumber on the x-axis is provided. Data are also provided in SPECtrum Processing Routines (SPECPR) format (Clark, 1993) which packages spectra and associated metadata descriptions into a single file (see the linked publication, Kokaly and others, 2017, for additional details on the SPECPR format and freely-available software than can be used to read files in SPECPR format). The data measured on the source spectrometers are denoted by the “splib07a” tag in filenames. In addition to providing the original measurements, the spectra have been convolved and resampled to different spectrometer and multispectral sensor characteristics. The following list specifies the identifying tag for the measured and convolved libraries and gives brief descriptions of the sensors. splib07a – this is the name of the SPECPR file containing the spectra measured on the Beckman, ASD, Nicolet and AVIRIS spectrometers. The data are provided with their original sampling positions (wavelengths) and bandpass values. The prefix “splib07a_” is at the beginning of the ASCII and GIF files pertaining to the measured spectra. splib07b – this is the name of the SPECPR file containing a modified version of the original measurements. The results from using spectral convolution to convert measurements to other spectrometer characteristics can be improved by oversampling (increasing sample density). Thus, splib07b is an oversampled version of the library, computed using simple cubic-spline interpolation to produce spectra with fine sampling interval (therefore a higher number of channels) for Beckman and AVIRIS measurements. The spectra in this version of the library are the data used to create the convolved and resampled versions of the library. The prefix “splib07b_” is at the beginning of the ASCII and GIF files pertaining to the oversampled spectra. s07_ASD – this is the name of the SPECPR file containing the spectral library measurements convolved to standard resolution ASD full range spectrometer characteristics. The standard reported wavelengths of the ASD spectrometers used by the USGS were used (2151 channels with wavelength positions starting at 350 nm and increasing in 1 nm increments). The bandpass values of each channel were determined by comparing measurements of reference materials made on ASD spectrometers in comparison to measurements made of the same materials on higher resolution spectrometers (the procedure is described in Kokaly, 2011, and discussed in Kokaly and Skidmore, 2015, and Kokaly and others, 2017). The prefix “s07ASD_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV95 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1995 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV95_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV96 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1996 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV96_” is at the beginning of the ASCII, and GIF files. s07_AV97 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1997 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV97_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV98 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1998 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV98_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV99 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1999 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV99_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV00 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2000 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV00_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV01 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2001 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV01_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV05 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2005 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV05_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV06 – this is the name of the SPECPR file containing the spectral library measurements convolved to
d
Labelled evaluation datasets of AIS Trajectories from Danish Waters for...
data.dtu.dk
bin
Updated Jul 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen (2023). Labelled evaluation datasets of AIS Trajectories from Danish Waters for Abnormal Behavior Detection [Dataset]. http://doi.org/10.11583/DTU.21511815.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.21511815.v1
Dataset updated
Jul 12, 2023
Dataset provided by
Technical University of Denmark
Authors
Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This item is part of the collection "AIS Trajectories from Danish Waters for Abnormal Behavior Detection"

DOI: https://doi.org/10.11583/DTU.c.6287841

Using Deep Learning for detection of maritime abnormal behaviour in spatio temporal trajectories is a relatively new and promising application. Open access to the Automatic Identification System (AIS) has made large amounts of maritime trajectories publically avaliable. However, these trajectories are unannotated when it comes to the detection of abnormal behaviour. The lack of annotated datasets for abnormality detection on maritime trajectories makes it difficult to evaluate and compare suggested models quantitavely. With this dataset, we attempt to provide a way for researchers to evaluate and compare performance.
We have manually labelled trajectories which showcase abnormal behaviour following an collision accident. The annotated dataset consists of 521 data points with 25 abnormal trajectories. The abnormal trajectories cover amoung other; Colliding vessels, vessels engaged in Search-and-Rescue activities, law enforcement, and commercial maritime traffic forced to deviate from the normal course

These datasets consists of labelled trajectories for the purpose of evaluating unsupervised models for detection of abnormal maritime behavior. For unlabelled datasets for training please refer to the collection. Link in Related publications.

The dataset is an example of a SAR event and cannot not be considered representative of a large population of all SAR events.

The dataset consists of a total of 521 trajectories of which 25 is labelled as abnormal. the data is captured on a single day in a specific region. The remaining normal traffic is representative of the traffic during the winter season. The normal traffic in the ROI has a fairly high seasonality related to fishing and leisure sailing traffic.

The data is saved using the pickle format for Python. Each dataset is split into 2 files with naming convention:

datasetInfo_XXX
data_XXX

Files named "data_XXX" contains the extracted trajectories serialized sequentially one at a time and must be read as such. Please refer to provided utility functions for examples. Files named "datasetInfo" contains Metadata related to the dataset and indecies at which trajectories begin in "data_XXX" files.

The data are sequences of maritime trajectories defined by their; timestamp, latitude/longitude position, speed, course, and unique ship identifer MMSI. In addition, the dataset contains metadata related to creation parameters. The dataset has been limited to a specific time period, ship types, moving AIS navigational statuses, and filtered within an region of interest (ROI). Trajectories were split if exceeding an upper limit and short trajectories were discarded. All values are given as metadata in the dataset and used in the naming syntax.

Naming syntax: data_AIS_Custom_STARTDATE_ENDDATE_SHIPTYPES_MINLENGTH_MAXLENGTH_RESAMPLEPERIOD.pkl

See datasheet for more detailed information and we refer to provided utility functions for examples on how to read and plot the data.
f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
H
Data from: The HAM10000 dataset, a large collection of multi-source...
dataverse.harvard.edu
opendatalab.com
+1more
Updated Feb 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Tschandl (2023). The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions [Dataset]. http://doi.org/10.7910/DVN/DBW86T
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DBW86T
Dataset updated
Feb 7, 2023
Dataset provided by
Harvard Dataverse
Authors
Philipp Tschandl
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86Thttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86T
Description
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3), with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the ground-truth in the same format as the HAM10000 data (public since 2023) is available as ISIC2018_Task3_Test_GroundTruth.csv.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their "ISIC Challenge Datasets" page. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. Some details on the abbreviated column headings: image_id: This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present. prob_m_dx_akiec, ... : m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed. prob_h_dx_akiec, ... : h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities. user_dx_without_interaction_akiec, ...: Number of participants choosing this diagnosis without interaction. user_dx_with_interaction_akiec, ...: Number of participants choosing this diagnosis with interaction. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.
Helsinki Tomography Challenge 2022 (HTC2022) open tomographic dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Meaney; Alexander Meaney; Fernando Silva de Moura; Fernando Silva de Moura; Markus Juvonen; Markus Juvonen; Samuli Siltanen; Samuli Siltanen (2023). Helsinki Tomography Challenge 2022 (HTC2022) open tomographic dataset [Dataset]. http://doi.org/10.5281/zenodo.8041800
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8041800
Dataset updated
Oct 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Meaney; Alexander Meaney; Fernando Silva de Moura; Fernando Silva de Moura; Markus Juvonen; Markus Juvonen; Samuli Siltanen; Samuli Siltanen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Helsinki
Description
This dataset was primarily designed for the Helsinki Tomography Challenge 2022 (HTC2022), but it can be used for generic algorithm research and development in 2D CT reconstruction.
The dataset contains 2D tomographic measurements, i.e., sinograms and the affiliated metadata containing measurement geometry and other specifications. The sinograms have already been pre-processed with background and flat-field corrections, and compensated for a slightly misaligned center of rotation in the cone-beam computed tomography scanner. The log-transforms from intensity measurements to attenuation data have also been already computed. The data has been stored as MATLAB structs and saved in .mat file format.
The purpose of HTC2022 was to develop algorithms for limited angle tomography. The challenge data consists of tomographic measurements of two sets of plastic phantoms with a diameter of 7 cm and with holes of differing shapes cut into them. The first set is the teaching data, containing five training phantoms. The second set consists of 21 test phantoms used in the challenge to test algorithm performance. The test phantom data was released after the competition period ended.
The training phantoms were designed to facilitate algorithm development and benchmarking for the challenge itself. Four of the training phantoms contain holes. These are labeled ta, tb, tc, and td. A fifth training phantom is a solid disc with no holes. We encourage subsampling these datasets to create limited data sinograms and comparing the reconstruction results to the ground truth obtainable from the full-data sinograms. Note that the phantoms are not all identically centered.

The teaching data includes the following files for each phantom:
The sinogram and all associated metadata (.MAT).
A pre-computed FBP reconstruction of the phantom (.MAT and .PNG).
A segmentation of the FBP reconstruction created with the procedure described below (.MAT and .PNG).
Also included in the teaching dataset is a MATLAB example script for how to work with the CT data.
The challenge test data is arranged into seven different difficulty levels, labeled 1-7, with each level containing three different phantoms, labeled A-C. As the difficulty level increases, the number of holes increases and their shapes become increasingly complex. Furthermore, the view angle is reduced as the difficulty level increases, starting with a 90 degree field of view at level 1, and reducing by 10 degrees at each increasing level of difficulty. The view-angles in the challenge data will not all begin from 0 degrees.
The test data includes the following files for each phantom:
The full sinogram and all associated metadata (.MAT).
The limited angle sinogram and all associated metadata, used to test the algorithms submitted to the challenge (.MAT).
A pre-computed FBP reconstruction of the phantom using the full data (.MAT and .PNG).
A pre-computed FBP reconstruction of the phantom using the limited angle data. These are of poor quality, and serve mainly as a demonstration of how FBP fails with limited angle data (.MAT and .PNG).
A segmentation of the FBP reconstruction using the full data, created with the procedure described below. This was used as the ground truth reference in the challenge (.MAT and .PNG).
A segmentation of the FBP reconstruction using the limited angle data, created with the procedure described below. These are of poor quality, and serve mainly as a demonstration of how FBP fails with limited angle data (.MAT and .PNG).
A photograph of the phantom, rotated and resized to match the ground truth segmentation (.PNG).
Also included in the test dataset is a collage in .PNG format, showing all the ground truth segmentation images and the photographs of the phantoms together.
As the orientation of CT reconstructions can depend on the tools used, we have included the example reconstructions for each of the phantoms to demonstrate how the reconstructions obtained from the sinograms and the specified geometry should be oriented. The reconstructions have been computed using the filtered back-projection algorithm (FBP) provided by the ASTRA Toolbox.
We have also included segmentation examples of the reconstructions to demonstrate the desired format for the final competition entries. The segmentation images for obtained by the following steps:
1) Set all negative pixel values in the reconstruction to zero.
2) Determine a threshold level using Otsu's method.
3) Globally threshold the image using the threshold level.
4) Perform a morphological closing on the image using a disc with a radius of 3 pixels.
The competitors were not obliged to follow the above procedure, and were encouraged to explore various segmentation techniques for the limited angle reconstructions.
For getting started with the data, we recommend the following MATLAB toolboxes:
HelTomo - Helsinki Tomography Toolbox
https://github.com/Diagonalizable/HelTomo/

The ASTRA Toolbox
https://www.astra-toolbox.com/

Spot – A Linear-Operator Toolbox
https://www.cs.ubc.ca/labs/scl/spot/

Using the above toolboxes for the Challenge was by no means compulsory: the metadata for each dataset contains a full specification of the measurement geometry, and the competitors were free to use any and all computational tools they want to in computing the reconstructions and segmentations.
All measurements were conducted at the Industrial Mathematics Computed Tomography Laboratory at the University of Helsinki.
Predictive Maintenance of Machines
kaggle.com
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RohithNair (2024). Predictive Maintenance of Machines [Dataset]. https://www.kaggle.com/datasets/nair26/predictive-maintenance-of-machines
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2024
Dataset provided by
Kaggle
Authors
RohithNair
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
This dataset provides information about Vibration levels , torque, process temperature and Fault.

The dataset in the image is a spreadsheet containing information about engine performance. The spreadsheet has the following variables:

UDI: This is likely a unique identifier for each engine. Product ID: This could be a specific code or identifier for the engine model. Type: This indicates the type of engine, possibly categorized by fuel type (e.g., M - motor, L - liquid). Air temperature (K): This is the air temperature in Kelvin around the engine. Process temperature [K]: This is the internal temperature of the engine during operation, measured in Kelvin. Speed (rpm): This is the rotational speed of the engine in revolutions per minute. Torque (Nm): This is the twisting force exerted by the engine, measured in Newton meters. Vibration Levels: This could be a measure of the engine's vibration intensity. Operational Hours: This is the total number of hours the engine has been operational. Tailure Type: This indicates the type of failure the engine experienced, if any. Rotational: This might be a specific type of failure related to the engine's rotation. This dataset could be used for various analytical purposes related to engine performance and maintenance. Here are some examples:

Identifying patterns of engine failure: By analyzing the data, you could identify correlations between specific variables (e.g., air temperature, operational hours) and engine failures. This could help predict potential failures and schedule preventative maintenance. Optimizing engine performance: By analyzing the data, you could identify the operating conditions (e.g., temperature, speed) that lead to optimal engine performance. This could help improve fuel efficiency and engine lifespan. Comparing engine types: The data could be used to compare the performance and efficiency of different engine types under various operating conditions. Building predictive models: The data could be used to train machine learning models to predict engine failures, optimize maintenance schedules, and improve overall engine performance. It's important to note that the specific value of this dataset would depend on the context and the intended use case. For example, if you are only interested in a specific type of engine or a particular type of failure, you might need to filter or subset the data accordingly.
2
MCS3
datacatalogue.ukdataservice.ac.uk
Updated May 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of London, Institute of Education, Centre for Longitudinal Studies (2024). MCS3 [Dataset]. http://doi.org/10.5255/UKDA-SN-8240-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8240-1
Dataset updated
May 17, 2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
University of London, Institute of Education, Centre for Longitudinal Studies
Time period covered
Jan 1, 2006
Description
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
to collect information on previously neglected topics, such as fathers' involvement in children's care and development
to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
to emphasise intergenerational links including those back to the parents' own childhood
to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
Additional objectives subsequently included for MCS were:
to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

Safeguarded versions of MCS studies:
The Safeguarded versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

Polygenic Indices
Polygenic indices are available under Special Licence SN 9437. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.

Sub-sample studies:
Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

Release of Sweeps 1 to 4 to Long Format (Summer 2020)
To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Secure Access datasets:
Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard Safeguarded Licence or Special Licence (see 'Access data' tab above).

Secure Access versions of the MCS include:
detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
Banded Distances to English Grammar Schools for MCS5 held under SN 8394
linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
linked Hospital of Birth data held under SN 5724.
The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).
The Millennium Cohort Study: Sweep 3 Banded Distances to Current, First, Second, and Third Choice Schools study provides banded distances to the current, first, second, and third choice school of MCS cohort members at sweep 3 (2006). The cohort members would therefore be aged between four and six years old, and have entered the primary school education system.
Estimating Confidence Intervals for 2020 Census Statistics Using Approximate...
registry.opendata.aws
Updated Aug 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Census Bureau (2024). Estimating Confidence Intervals for 2020 Census Statistics Using Approximate Monte Carlo Simulation (2010 Census Proof of Concept) [Dataset]. https://registry.opendata.aws/census-2010-amc-mdf-replicates/
Explore at:
Dataset updated
Aug 5, 2024
Dataset provided by
United States Census Bureauhttp://census.gov/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Approximate Monte Carlo (AMC) method seed Privacy Protected Microdata File (PPMF0) and PPMF replicates (PPMF1, PPMF2, ..., PPMF25) are a set of microdata files intended for use in estimating the magnitude of error(s) introduced by the 2020 Decennial Census Disclosure Avoidance System (DAS) into the Redistricting and DHC products. The PPMF0 was created by executing the 2020 DAS TopDown Algorithm (TDA) using the confidential 2010 Census Edited File (CEF) as the initial input; the replicates were then created by executing the 2020 DAS TDA repeatedly with the PPMF0 as its initial input. Inspired by analogy to the use of bootstrap methods in non-private contexts, U.S. Census Bureau (USCB) researchers explored whether simple calculations based on comparing each PPMFi to the PPMF0 could be used to reliably estimate the scale of errors introduced by the 2020 DAS, and generally found this approach worked well.

The PPMF0 and PPMFi files contained here are provided so that external researchers can estimate properties of DAS-introduced error without privileged access to internal USCB-curated data sets; further information on the estimation methodology can be found in Ashmead et. al 2024.

The 2010 DHC AMC seed PPMF0 and PPMF replicates have been cleared for public dissemination by the USCB Disclosure Review Board (CBDRB-FY24-DSEP-0002). The 2010 PPMF0 included in these files was produced using the same parameters and settings as were used to produce the 2010 Demonstration Data Product Suite (2023-04-03) PPMF, but represents an independent execution of the TopDown Algorithm. The PPMF0 and PPMF replicates contain all Person and Units attributes necessary to produce the Redistricting and DHC publications for both the United States and Puerto Rico, and include geographic detail down to the Census Block level. They do not include attributes specific to either the Detailed DHC-A or Detailed DHC-B products; in particular, data on Major Race (e.g., White Alone) is included, but data on Detailed Race (e.g., Cambodian) is not included in the PPMF0 and replicates.

The 2020 AMC replicate files for estimating confidence intervals for the official 2020 Census statistics are available.
d
DSS - People Served by Town and Type of Assistance (TOA) by Month - CY...
catalog.data.gov
data.ct.gov
+1more
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2025). DSS - People Served by Town and Type of Assistance (TOA) by Month - CY 2023-2025 [Dataset]. https://catalog.data.gov/dataset/dss-people-served-by-town-and-type-of-assistance-toa-by-month-cy-2023
Explore at:
Dataset updated
Nov 15, 2025
Dataset provided by
data.ct.gov
Description
In order to facilitate public review and access, enrollment data published on the Open Data Portal is provided as promptly as possible after the end of each month or year, as applicable to the data set. Due to eligibility policies and operational processes, enrollment can vary slightly after publication. Please be aware of the point-in-time nature of the published data when comparing to other data published or shared by the Department of Social Services, as this data may vary slightly. As a general practice, for monthly data sets published on the Open Data Portal, DSS will continue to refresh the monthly enrollment data for three months, after which time it will remain static. For example, when March data is published the data in January and February will be refreshed. When April data is published, February and March data will be refreshed, but January will not change. This allows the Department to account for the most common enrollment variations in published data while also ensuring that data remains as stable as possible over time. In the event of a significant change in enrollment data, the Department may republish reports and will notate such republication dates and reasons accordingly. In March 2020, Connecticut opted to add a new Medicaid coverage group: the COVID-19 Testing Coverage for the Uninsured. Enrollment data on this limited-benefit Medicaid coverage group is being incorporated into Medicaid data effective January 1, 2021. Enrollment data for this coverage group prior to January 1, 2021, was listed under State Funded Medical. Effective January 1, 2021, this coverage group have been separated: (1) the COVID-19 Testing Coverage for the Uninsured is now G06-I and is now listed as a limited benefit plan that rolls up into “Program Name” of Medicaid and “Medical Benefit Plan” of HUSKY Limited Benefit; (2) the emergency medical coverage has been separated into G06-II as a limited benefit plan that rolls up into “Program Name” of Emergency Medical and “Medical Benefit Plan” of Other Medical. An historical accounting of enrollment of the specific coverage group starting in calendar year 2020 will also be published separately. The data represents number of active recipients who received benefits from a type of assistance (TOA) in that calendar year and month. A recipient may have received benefits from multiple TOAs in the same month; if so that recipient will be included in multiple categories in this dataset (counted more than once.) For privacy considerations, a count of zero is used for counts less than five. The methodology for determining the address of the recipients changed: 1. The address of a recipient in the ImpaCT system is now correctly determined specific to that month instead of using the address of the most recent month. This resulted in some shuffling of the recipients among townships starting in October 2016. 2. If, in a given month, a recipient has benefit records in both the HIX system and in the ImpaCT system, the address of the recipient is now calculated as follows to resolve conflicts: Use the residential address in ImpaCT if it exists, else use the mailing address in ImpaCT if it exists, else use the address in HIX. This resulted in a reduction in counts for most townships starting in March 2017 because a single address is now used instead of two when the systems do not agree.
Gamelytics: Mobile Analytics Challenge
kaggle.com
zip
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
letocen (2025). Gamelytics: Mobile Analytics Challenge [Dataset]. https://www.kaggle.com/datasets/debs2x/gamelytics-mobile-analytics-challenge
Explore at:
zip(66154620 bytes)Available download formats
Dataset updated
Feb 16, 2025
Authors
letocen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Gamelytics: Mobile Analytics Challenge 🎮📊

Subtitle

Unlock key insights into player behavior, optimize game metrics, and make data-driven decisions!

Description

Welcome to the Gamelytics: Mobile Analytics Challenge, a real-world-inspired dataset designed for data enthusiasts eager to dive deep into mobile game analytics. This dataset challenges you to analyze player behavior, evaluate A/B test results, and develop metrics for assessing game event performance.

Project Context & Tasks

Task 1: Retention Analysis

🔍 Objective: Calculate the daily retention rate of players, starting from their registration date.
📄 Data Sources:
- reg_data.csv: Contains user registration timestamps (reg_ts) and unique user IDs (uid).
- auth_data.csv: Contains user login timestamps (auth_ts) and unique user IDs (uid).
💡 Challenge: Develop a Python function to calculate retention, allowing you to test its performance on both the complete dataset and smaller samples.

Task 2: A/B Testing for Promotional Offers

🔍 Objective: Identify the best-performing promotional offer set by comparing key revenue metrics.
💰 Context:
- The test group has a 5% higher ARPU than the control group.
- In the control group, 1928 users out of 202,103 are paying customers.
- In the test group, 1805 users out of 202,667 are paying customers.
📊 Data Sources:
- ab_test.csv: Includes user_id, revenue, and testgroup columns.
💡 Challenge: Decide which offer set performs best, and determine the appropriate metrics for a robust evaluation.

Task 3: Event Performance Evaluation in "Plants & Gardens"

🔍 Objective: Develop metrics to assess the success of a time-limited in-game event where players can earn unique rewards.
🍃 Context: Players complete levels to win exclusive items, bonuses, or coins. In a variation, players may be penalized (sent back levels) after failed attempts.
💡 Challenge: Define how metrics should change under the penalty variation and identify KPIs for evaluating event success.

Dataset Information

The provided data is split into three files, each detailing a specific aspect of the application. Here's a breakdown:

1. User Registration Data (reg_data.csv)

Records: 1,000,000

Columns:

reg_ts: Registration time (Unix time, int64)

uid: Unique user ID (int64)

Memory Usage: 15.3 MB

Description: This dataset contains user registration timestamps and IDs. It is clean and contains no missing data.

2. User Activity Data (auth_data.csv)

Records: 9,601,013

Columns:

auth_ts: Login time (Unix time, int64)

uid: Unique user ID (int64)

Memory Usage: 146.5 MB

Description: This dataset captures user login timestamps and IDs. It is clean and contains no missing data.

3. A/B Testing Data (ab_test.csv)

Records: 404,770

Columns:

user_id: Unique user ID (int64)

revenue: Revenue (int64)

testgroup: Test group (object)

Memory Usage: ~9.3 MB

Description: This dataset provides insights into A/B test results, including revenue and group allocation for each user. It is clean and ready for analysis.

Inspiration & Benefits

Real-World Relevance: Inspired by actual challenges in mobile gaming analytics, this dataset lets you solve meaningful problems.

Diverse Data Types: Work with registration logs, activity timestamps, and experimental results to gain a holistic understanding of mobile game data.

Skill Building: Perfect for those honing skills in retention analysis, A/B testing, and event-based performance evaluation.

Community Driven: Built to inspire collaboration and innovation in the data analytics community. 🚀

Whether you’re a beginner or an expert, this dataset offers an engaging challenge to sharpen your analytical skills and drive actionable insights. Happy analyzing! 🎉📈

Facebook

Twitter

Click to copy link

Link copied

Cite

Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets

Explore at:

38 scholarly articles cite this dataset (View in Google Scholar)

txtAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0034338

Dataset updated

Jun 8, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Anne E. Goodenough; Adam G. Hart; Richard Stafford

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

Clear search

Close search

Google apps

Main menu

Data from: Regression with Empirical Variable Selection: Description of a...

Large Language Models Comparison Dataset

Key Details:

What’s Inside?

The banksia plot: a method for visually comparing point estimates and...

Companion data for the creation of a banksia plot:

Background:

Methods:

Results:

Conclusions:

Reddit /r/datasets Dataset

The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Surrogate flood model comparison - Datasets and python code

Data from: A 24-hour dynamic population distribution dataset based on mobile...

PC Component Prices Comparison

PC Component Prices Comparison

Detailed Prices, Scores, and Reviews from Different Brands and Categories

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Benchmark Multi-Omics Datasets for Methods Comparison

NBA Players Performance

NBA Players Performance

Players Performance & Statistics

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

HTMLmetadata HTML formatted text files describing samples and spectra,...

Labelled evaluation datasets of AIS Trajectories from Danish Waters for...

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

Data from: The HAM10000 dataset, a large collection of multi-source...

Helsinki Tomography Challenge 2022 (HTC2022) open tomographic dataset

Predictive Maintenance of Machines

MCS3

Estimating Confidence Intervals for 2020 Census Statistics Using Approximate...

DSS - People Served by Town and Type of Assistance (TOA) by Month - CY...

Gamelytics: Mobile Analytics Challenge

Gamelytics: Mobile Analytics Challenge 🎮📊

Subtitle

Description

Project Context & Tasks

Task 1: Retention Analysis

Task 2: A/B Testing for Promotional Offers

Task 3: Event Performance Evaluation in "Plants & Gardens"

Dataset Information

1. User Registration Data (reg_data.csv)

2. User Activity Data (auth_data.csv)

3. A/B Testing Data (ab_test.csv)

Inspiration & Benefits

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets

1. User Registration Data (`reg_data.csv`)

2. User Activity Data (`auth_data.csv`)

3. A/B Testing Data (`ab_test.csv`)