48 datasets found

c
Walmart products free dataset
crawlfeeds.com
csv, zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Walmart products free dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-products-free-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Apr 27, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Discover the Walmart Products Free Dataset, featuring 2,000 records in CSV format. This dataset includes detailed information about various Walmart products, such as names, prices, categories, and descriptions.

It’s perfect for data analysis, e-commerce research, and machine learning projects. Download now and kickstart your insights with accurate, real-world data.
EPX/USD Binance Historical Data for ANN
kaggle.com
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EMİRHAN BULUT (2024). EPX/USD Binance Historical Data for ANN [Dataset]. http://doi.org/10.34740/kaggle/dsv/8479299
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8479299
Dataset updated
May 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
EMİRHAN BULUT
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About Dataset

Context:

This dataset provides comprehensive historical data for the EPX/USDT trading pair on Binance, dating from November 21, 2021, to May 21, 2024. It is particularly curated for facilitating advanced predictive analytics and machine learning projects, especially in the field of financial time series forecasting.

Sources:

The data was meticulously sourced from investing.com, a reliable platform for financial information and data analytics. It captures critical daily trading metrics, including the opening, closing, highest, and lowest prices, along with daily trading volume and percentage changes. This rich dataset is integral for constructing robust models that can predict future trading behaviors and trends.

Inspiration:

With a background in artificial intelligence and financial modeling, I have embarked on a project to predict the future prices of EPX/USDT using advanced neural network architectures. This project aims to leverage the power of several cutting-edge algorithms to create a robust forecasting backbone, combining:

Gated Recurrent Units (GRU): Employed to capture the complexities of sequential data while efficiently handling long-term dependencies.

Long Short-Term Memory (LSTM): Utilized to overcome the vanishing gradient problem, ensuring the model remembers essential patterns over extended periods.

Recurrent Neural Networks (RNN): Applied to process sequences of trading data, retaining the temporal dynamics and dependencies inherent in time series data.

Transformers: Integrated to benefit from their ability to handle both local and global dependencies in data, ensuring more accurate and contextually aware predictions.

The synergy of these algorithms aims to forge a resilient and accurate predictive model, capable of anticipating price movements and trends for the month of June 2024. This project showcases the potential of deploying hybrid neural network architectures for tackling real-world financial forecasting challenges.

Usage:

Users can utilize this dataset to:

Conduct time series analysis and predictive modeling.

Train and evaluate various machine learning and deep learning models.

Develop custom financial forecasting tools and algorithms.

Enhance their understanding of cryptocurrency trading patterns and dynamics.

With this dataset, the financial forecasting community can explore novel modeling techniques and validate their approaches against real-world data, contributing to the development of more precise and reliable predictive models.

Conclusion:

This dataset not only serves as a vital resource for academic and professional research but also stands as a testament to the power of innovative neural network architectures in the realm of financial forecasting. Whether you are a novice data scientist eager to explore time series data or a seasoned researcher looking to refine your models, this dataset offers a valuable foundation for your endeavors.
E
BIGAN
healthinformationportal.eu
www-acc.healthinformationportal.eu
html
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Instituto Aragones De Ciencias De La Salud (2023). BIGAN [Dataset]. https://www.healthinformationportal.eu/health-information-sources/bigan
Explore at:
htmlAvailable download formats
Dataset updated
Mar 31, 2023
Dataset authored and provided by
Instituto Aragones De Ciencias De La Salud
License
https://bigan.iacs.es/https://bigan.iacs.es/
Variables measured
sex, title, topics, country, language, data_owners, description, sample_size, age_range_to, contact_name, and 16 more
Measurement technique
Multiple sources
Description
BIGAN is the Big Data project of the Department of Health of the Government of Aragon, created to improve healthcare using data that are routinely collected within the public health system of Aragon. Development of the project has been entrusted to the Aragon Institute of Health Sciences (IACS).

The purpose of the project is to integrate all data collected within the health system on a technological platform, where it can be analysed by healthcare professionals, managers, educators, and researchers. The ultimate goal is to improve the healthcare system and the health of residents in Aragon through data observation. To achieve this, collection, analysis, and sharing of information between all involved stakeholders is vital.
f
Expensive but Worth It: Live Projects in Statistics, Data Science, and...
tandf.figshare.com
pdf
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Ritter; L. Allison Jones-Farmer; Frederick W. Faltin (2025). Expensive but Worth It: Live Projects in Statistics, Data Science, and Analytics Courses [Dataset]. http://doi.org/10.6084/m9.figshare.26813062.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26813062.v1
Dataset updated
Apr 1, 2025
Dataset provided by
Taylor & Francis
Authors
Christian Ritter; L. Allison Jones-Farmer; Frederick W. Faltin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Students in statistics, data science, analytics, and related fields study the theory and methodology of data-related topics. Some, but not all, are exposed to experiential learning courses that cover essential parts of the life cycle of practical problem-solving. Experiential learning enables students to convert real-world issues into solvable technical questions and effectively communicate their findings to clients. We describe several experiential learning course designs in statistics, data science, and analytics curricula. We present findings from interviews with faculty from the U.S., Europe, and the Middle East and surveys of former students. We observe that courses featuring live projects and coaching by experienced faculty have a high career impact, as reported by former participants. However, such courses are labor-intensive for both instructors and students. We give estimates of the required effort to deliver courses with live projects and the perceived benefits and tradeoffs of such courses. Overall, we conclude that courses offering live-project experiences, despite being more time-consuming than traditional courses, offer significant benefits for students regarding career impact and skill development, making them worthwhile investments. Supplementary materials for this article are available online.
NIST Statistical Reference Datasets - SRD 140
catalog.data.gov
datasets.ai
+2more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
B
Replication data for: Stochastic and Deterministic Modeling of the Future...
borealisdata.ca
open.library.ubc.ca
Updated Feb 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahram Yarmand (2019). Replication data for: Stochastic and Deterministic Modeling of the Future Price of Crude oil and Bottled Water [Dataset]. http://doi.org/10.5683/SP2/VPF8J8
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/VPF8J8
Dataset updated
Feb 27, 2019
Dataset provided by
Borealis
Authors
Shahram Yarmand
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Sep 10, 2017 - Dec 17, 2017
Area covered
United States, Crude Oil Prices: West Texas Intermediate (WTI) and U.S. bottled water
Description
Deterministic and stochastic are two methods for modeling of crude oil and bottled water market. Forecasting the price of the market directly affected energy producer and water user.There are two software, Tableau and Python, which are utilized to model and visualize both markets for the aim of estimating possible price in the future.The role of those software is to provide an optimal alternative with different methods (deterministic versus stochastic). The base of predicted price in Tableau is deterministic—global optimization and time series. In contrast, Monte Carlo simulation as a stochastic method is modeled by Python software. The purpose of the project is, first, to predict the price of crude oil and bottled water with stochastic (Monte Carlo simulation) and deterministic (Tableau software),second, to compare the prices in a case study of Crude Oil Prices: West Texas Intermediate (WTI) and the U.S. bottled water. 1. Introduction Predicting stock and stock price index is challenging due to uncertainties involved. We can analyze with a different aspect; the investors perform before investing in a stock or the evaluation of stocks by means of studying statistics generated by market activity such as past prices and volumes. The data analysis attempt to identify stock patterns and trends that may predict the estimation price in the future. Initially, the classical regression (deterministic) methods were used to predict stock trends; furthermore, the uncertainty (stochastic) methods were used to forecast as same as deterministic. According to Deterministic versus stochastic volatility: implications for option pricing models (1997), Paul Brockman & Mustafa Chowdhury researched that the stock return volatility is deterministic or stochastic. They reported that “Results reported herein add support to the growing literature on preference-based stochastic volatility models and generally reject the notion of deterministic volatility” (Pag.499). For this argument, we need to research for modeling forecasting historical data with two software (Tableau and Python). In order to forecast analyze Tableau feature, the software automatically chooses the best of up to eight models which generates the highest quality forecast. According to the manual of Tableau , Tableau assesses forecast quality optimize the smoothing of each model. The optimization model is global. The main part of the model is a taxonomy of exponential smoothing that analyzes the best eight models with enough data. The real- world data generating process is a part of the forecast feature and to support deterministic method. Therefore, Tableau forecast feature is illustrated the best possible price in the future by deterministic (time – series and prices). Monte Carlo simulation (MCs) is modeled by Python, which is predicted the floating stock market index . Forecasting the stock market by Monte Carlo demonstrates in mathematics to solve various problems by generating suitable random numbers and observing that fraction of the numbers that obeys some property or properties. The method utilizes to obtain numerical solutions to problems too complicated to solve analytically. It randomly generates thousands of series representing potential outcomes for possible returns. Therefore, the variable price is the base of a random number between possible spot price between 2002-2016 that present a stochastic method.
f
Rmd code survival federated.
plos.figshare.com
txt
Updated Nov 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romain Jégou; Camille Bachot; Charles Monteil; Eric Boernert; Jacek Chmiel; Mathieu Boucher; David Pau (2024). Rmd code survival federated. [Dataset]. http://doi.org/10.1371/journal.pone.0312697.s009
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312697.s009
Dataset updated
Nov 14, 2024
Dataset provided by
PLOS ONE
Authors
Romain Jégou; Camille Bachot; Charles Monteil; Eric Boernert; Jacek Chmiel; Mathieu Boucher; David Pau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MethodsThe objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.ResultsThe cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.ConclusionOur project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
McKinsey Solve Assessment Data (2018–2025)
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11720554
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oluwademilade Adeniyi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

📌 Inspiration & Purpose

Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

🔍 Dataset Source

Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

🧾 Dataset Structure

This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

✅ Why Use This Dataset

Benchmark educational and regional trends in global assessments

Build KPI cards, donut charts, histograms, or speedometer visuals

Train pass/fail classifiers or regression models

Segment job applicants by role, location, or game behaviour

Showcase portfolio skills across Excel, SQL, Power BI, Python, or R

Test dashboards or predictive logic in a business-relevant scenario

💡 Credit & Collaboration

Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)

Collaborator: ChatGPT by OpenAI

Inspired by: McKinsey & Company’s Solve Assessment

A dataset of Data Subject Access Request Packages

zenodo.org

Updated Jul 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Nicola Leschke; Nicola Leschke; Daniela Pöhn; Daniela Pöhn; Frank Pallas; Frank Pallas (2024). A dataset of Data Subject Access Request Packages [Dataset]. http://doi.org/10.5281/zenodo.11634938

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.11634938

Dataset updated

Jul 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nicola Leschke; Nicola Leschke; Daniela Pöhn; Daniela Pöhn; Frank Pallas; Frank Pallas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Overview

This dataset is a minimal example of Data Subject Access Request Packages (SARPs), as they can be retrieved under data protection laws, specifically the GDPR. It includes data from two data subjects, each with accounts for five major sevices, namely Amazon, Apple, Facebook, Google, and Linkedin.

Purpose and Usage

This dataset is meant to be an initial dataset that allows for manual exploration of structures and contents found in SARPs. Hence, the number of controllers and user profiles should be minimal but sufficient to allow cross-subject and cross-controller analysis. This dataset can be used to explore structures, formats and data types found in real-world SARPs. Thereby, the planning of future SARP-based research projects and studies shall be facilitated.

We invite other researchers to use this dataset to explore the structure of SARPs. The envisioned primary usage includes the development of user-centric privacy interfaces and other technical contributions in the area of data access rights. Moreover, these packages can also be used for examplified data analyses, although no substantive research questions can be answered using this data. In particular, this data does not reflect how data subjects behave in real world. However, it is representative enough to give a first impression on the types of data analysis possible when using real world data.

Data Generation

In order to allow cross-subject analysis, while keeping the re-identification risk minimal, we used research-only accounts for the data generation. A detailed explanation of the data generation method can be found in the paper corresponding to the dataset, accepted for the Annual Privacy Forum 2024.

In short, two user profiles were designed and corresponding accounts were created for each of the five services. Then, those accounts were used for two to four month. During the usage period, we minimized the amount of identifying data and also avoided interactions with data subjects not part of this research. Afterwards, we performed a data access request via the controller's web interface. Finally, the data was cleansed as described in detail in the acconpanying paper and in brief within the following section.

Data Cleansing

Before publication, both possibly identifying information and security relevant attributes need to be obfuscated or deleted. Moreover, multi-party data (especially messages with external entities) must be deleted. If data is obfuscated, we made sure to substitute multiple occurances of the same information with the same replacement.
We provide a list of deleted and obfuscated items, the obfuscation scheme and, if applicable, the replacement.

The list of obfuscated items looks like the following example:

path	filetype	filename	attribute	scheme	replacement
linkedin\Linkedin_Basic	csv	messages.csv	TO	semantic description	Firstname Lastname
gooogle\Meine Aktivitäten\Datenexport	html	MeineAktivitäten.html	IP Address	loopback	127.142.201.194
facebook\personal_information	json	profile_information.json	emails	semantic description	firstname.lastname@gmail.com

Data Characterization

To give you an overview of the dataset, we publicly provide some meta-data about the usage time and SARP characteristics of exports from subject A/ subject B.

provider	usage time (in month)	export options	file types	# subfolders	# files	export size
Amazon	2/4	all categories	CSV (32/49) EML (2/5) JPEG (1/2) JSON (3/3) PDF (9/10) TXT (4/4)	41/49	51/73	1.2 MB / 1.4 MB
Apple	2/4	all data max. 1 GB/ max. 4 GB	CSV (8/3)	20/1	8/3	71.8 KB / 294.8 KB
Facebook	2/4	all data JSON/HTML on my computer	JSON (39/0) HTML (0/63) TXT (29/28) JPG (0/4) PNG (1/15) GIF (7/7)	45/76	76/117	12.3 MB / 13.5 MB
Google	2/4	all data frequency once ZIP max. 4 GB	HTML (8/11) CSV (10/13) JSON (27/28) TXT (14/14) PDF (1/1) MBOX (1/1) VCF (1/0) ICS (1/0) README (1/1) JPG (0/2)	44/51	64/71	1.54 MB /1.2 MB
LinkedIn	2/4	all data	CSV (18/21)	0/0 (part 1/2) 0/0 (part 1/2)	13/18 19/21	3.9 KB / 6.0 KB 6.2 KB / 9.2 KB

Authors

This data collection was performed by Daniela Pöhn (Universität der Bundeswehr München, Germany), Frank Pallas and Nicola Leschke (Paris Lodron Universität Salzburg, Austria). For questions, please contact nicola.leschke@plus.ac.at.

Accompanying Paper

The dataset was collected according to the method presented in:
Leschke, Pöhn, and Pallas (2024). "How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages". Accepted for Annual Privacy Forum 2024.

m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
q
A Remote Introductory Biology Lab Using Backyard Birdwatching to Teach Data...
qubeshub.org
Updated Aug 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer Rahn (2021). A Remote Introductory Biology Lab Using Backyard Birdwatching to Teach Data Analysis and Communication [Dataset]. http://doi.org/10.24918/cs.2020.43
Explore at:
Unique identifier
https://doi.org/10.24918/cs.2020.43
Dataset updated
Aug 29, 2021
Dataset provided by
QUBES
Authors
Jennifer Rahn
Description
Asking questions, designing experiments, collecting data and interpreting those data, and communicating the results are some of the most important concepts an undergraduate biology lab can teach students. When the SARS-CoV2 pandemic forced our institution to transition to online learning in the middle of the spring 2020 semester, I was faced with the dilemma of converting a large laboratory course normally structured around active and guided inquiry labs, into an online course, while still covering the planned subject material. To that end, I developed this multi-week lab exercise featuring birdwatching to give students a chance to collect real-world data and use that large, complex dataset to answer an ecological question of their choosing. For this exercise, students were instructed to watch birds at their homes or where they were quarantining, and record a number of different parameters and observations twice a week for three weeks. All records were compiled across the entire class of 347 students. Students were then asked to pick an ecological question that could be addressed with these data, and use the full dataset and a PivotTable to generate a chart summarizing the data relevant to their question. The final project consisted of developing an infographic summarizing the key question and findings of their study in language geared toward the birdwatching public. This exercise gave students the opportunity to explore the scientific process including hypothesis development, data collection, analysis and summarization while still feeling connected to a larger class project.

Primary image: Mockingbirds are one of the most common songbirds identified by students in this project. This image was taken by the author and is her own work.
o
Synthetic Student Profiles with Academic Outcomes Dataset
opendatabay.com
.csv
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Student Profiles with Academic Outcomes Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/41933042-6ec7-49c4-b151-508fc8f5592b
Explore at:
.csvAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
The Synthetic Student Performance Dataset is designed to support research, analytics, and educational projects focused on academic performance, family background, and behavioral factors affecting students. It mirrors real-world educational data and offers diverse features to explore student success patterns.

Dataset Features

student_id: Unique identifier for each student.

school: Attended school (e.g., GP or MS).

sex: Gender of the student (F/M).

age: Student's age in years.

address_type: Urban or Rural home location.

family_size: Family size (Less than or equal to 3 / Greater than 3).

parent_status: Parental cohabitation status (Living together / Apart).

mother_education / father_education: Highest education level completed (e.g., Primary, Secondary, Higher).

mother_job / father_job: Occupation of the student's parents.

school_choice_reason: Reason for choosing the school (e.g., Reputation, Proximity).

guardian: Primary caregiver (e.g., Mother, Father, Other).

travel_time: Daily travel time to school.

study_time: Weekly study time outside school.

class_failures: Number of past class failures.

school_support / family_support: Extra academic support received at school and from family (Yes/No).

extra_paid_classes: Attending paid private tutoring (Yes/No).

activities: Participation in extracurricular activities (Yes/No).

nursery_school: Attended preschool (Yes/No).

higher_ed: Desire to pursue higher education (Yes/No).

internet_access: Access to the internet at home (Yes/No).

romantic_relationship: Currently in a romantic relationship (Yes/No).

family_relationship: Quality of family relationships (numeric scale).

free_time: Amount of free time after school (numeric scale).

social: Frequency of social activities with peers (numeric scale).

weekday_alcohol / weekend_alcohol: Alcohol consumption levels on weekdays and weekends.

health: Current health status (1–5 scale).

absences: Number of school absences.

grade_1 / grade_2 / final_grade: First and second period grades and final academic performance.

Distribution

https://storage.googleapis.com/opendatabay_public/41933042-6ec7-49c4-b151-508fc8f5592b/7537d999da0b_student_performance_visuals.png" alt="Synthetic student performance data visuals and distribution.png">

Usage

This dataset is ideal for:

Academic Performance Prediction: Predict final grades based on behavioral and background features.

Feature Importance Analysis: Identify key influences on student success.

Sociological Insights: Understand the impact of family, relationship, and lifestyle factors on education.

Model Training: Suitable for classification, regression, and clustering tasks in educational data mining.

Coverage

Captures a comprehensive view of student life, including family background, academic history, health, and lifestyle. The dataset supports multi-disciplinary research across education, sociology, and data science.

License

CC0 (Public Domain)

Who Can Use It

Educational Researchers: For testing interventions and identifying risk factors.

Data Scientists and ML Practitioners: For building predictive models in education.

Instructors and Students: For coursework in data analysis, machine learning, and statistics.
f
R code dataset derivation centralized.
plos.figshare.com
txt
Updated Nov 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romain Jégou; Camille Bachot; Charles Monteil; Eric Boernert; Jacek Chmiel; Mathieu Boucher; David Pau (2024). R code dataset derivation centralized. [Dataset]. http://doi.org/10.1371/journal.pone.0312697.s011
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312697.s011
Dataset updated
Nov 14, 2024
Dataset provided by
PLOS ONE
Authors
Romain Jégou; Camille Bachot; Charles Monteil; Eric Boernert; Jacek Chmiel; Mathieu Boucher; David Pau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MethodsThe objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.ResultsThe cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.ConclusionOur project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
h
persona_omni_verse
huggingface.co
Updated Oct 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
maw studio (2024). persona_omni_verse [Dataset]. https://huggingface.co/datasets/MawLab/persona_omni_verse
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 27, 2024
Dataset authored and provided by
maw studio
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🌌 Persona Omni Verse

Overview

Welcome to the Persona Omni Verse dataset! 🎉 This dataset is a rich collection of diverse and detailed personal information, meticulously crafted to provide comprehensive insights into fictional profiles. Ideal for use in various data analysis, machine learning, and simulation projects, it offers a wide range of attributes designed to reflect real-world complexity and variety. 🌟

Dataset Details

Name: Persona Omni Verse… See the full description on the dataset page: https://huggingface.co/datasets/MawLab/persona_omni_verse.
f
Milestones and study timeline.
plos.figshare.com
xls
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yvonne Tran; Mariano Cabezas; Frank Tran; Jo-Anne Manski-Nankervis; Jitendra Jonnagaddala; Diana Tang; Kompal Sinha; Mohammad Nure Alam; Jessica Monaghan; Andrew Donald; Rebecca Mitchell; Matthew Crossley; Niloufer Selvadurai; Bamini Gopinath (2025). Milestones and study timeline. [Dataset]. http://doi.org/10.1371/journal.pone.0320294.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320294.t001
Dataset updated
May 7, 2025
Dataset provided by
PLOS ONE
Authors
Yvonne Tran; Mariano Cabezas; Frank Tran; Jo-Anne Manski-Nankervis; Jitendra Jonnagaddala; Diana Tang; Kompal Sinha; Mohammad Nure Alam; Jessica Monaghan; Andrew Donald; Rebecca Mitchell; Matthew Crossley; Niloufer Selvadurai; Bamini Gopinath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundResearch suggests that early detection of hearing loss, coupled with prompt and appropriate treatment, can significantly alleviate its negative impacts. Routinely collected real-world data, such as those from electronic health records data, provide an opportunity to enhance our understanding of the management of hearing loss. This project aims to create the HEaring Impairment Data Infrastructure (HEIDI) data lake by assembling datasets from general practice (GP), audiology clinic registries, and cohort studies to investigate hearing-impaired patients’ care pathways. This study seeks to answer key research questions such as “How do patients with hearing loss navigate the care pathway from general practice clinics to audiology clinics?”.Methods and analysisThe HEIDI data lake will be hosted in a secure research environment at Macquarie University, Sydney, Australia, that complies with Australian legal and ethical requirements to protect patient privacy. Afterwards, new integrated datasets will be built through data linkage of hearing and GP datasets. Finally, the HEIDI data warehouse will be developed and used as a stand-alone dataset for future research. Descriptive and predictive analytics will be undertaken to answer our research questions with the data warehouse. Descriptive analysis will include both conventional and advanced statistical techniques and visualisation that will help us understand the journey of patients with hearing loss. Machine learning strategies such as deep neural networks, support vector machines, and random forests for predictive analytics will also be employed to identify participants that could benefit from proactive management by their GP and determine the effect of interventions through the patient’s journey (e.g., referrals to specialist) on outcomes (e.g., adherence to the intervention).DisseminationThe findings will be disseminated widely through academic journals, conferences and other presentations.
Moneyball: The Analytics Revolution in Baseball
kaggle.com
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Moneyball: The Analytics Revolution in Baseball [Dataset]. https://www.kaggle.com/datasets/adilshamim8/baseball
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 20, 2025
Dataset provided by
Kaggle
Authors
Adil Shamim
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22146488%2Fea6d9cd5a0304946f83cb1cfb08cde27%2Fimage.jpg?generation=1742439627767040&alt=media" alt="">

This dataset provides a unique glimpse into the revolutionary approach of the Oakland Athletics during the early 2000s—a period famously chronicled in the Moneyball story. When budget constraints forced the team to rethink traditional scouting, Billy Beane and Paul DePodesta turned to advanced statistics, focusing on undervalued metrics like on-base percentage (OBP) and slugging percentage (SLG) to identify key contributors on the field.

Dataset Overview

Timeframe & Context:
Capturing the early 2000s, the dataset reflects the strategic shift in baseball, where data-driven decision-making helped underdog teams compete against financially superior clubs.

Key Features:
The dataset includes critical performance metrics such as:

Team Information: Team name, league, and year.

Performance Metrics: Runs Scored (RS), Runs Allowed (RA), Wins (W), Games Played (G).

Advanced Statistics: On-Base Percentage (OBP), Slugging Percentage (SLG), Batting Average (BA).

Outcome Indicators: Playoff appearance (binary indicator), Season Ranking, Playoffs Ranking.

Opponent Metrics: Opponent’s On-Base Percentage (OOBP) and Opponent’s Slugging Percentage (OSLG).

Inspiration and Acknowledgements

Inspired by the groundbreaking Moneyball strategy, this dataset was originally assembled to highlight how teams can leverage data analytics to uncover hidden value. The approach not only reshaped baseball recruitment but also set the stage for modern sports analytics. The data itself was gathered from reliable sources such as baseball-reference.com and Sports-reference.com, and it was featured in The Analytics Edge course on EdX, which further demonstrates its educational and analytical value.

Potential Uses

Sports Analytics:
Explore how key statistics correlate with game outcomes, playoff appearances, and overall team success.

Predictive Modeling:
Build models to predict wins, playoff qualification, and even future team performance.

Educational Resource:
Ideal for coursework or projects in data science and sports analytics, offering a real-world example of how data can drive strategic decisions.

This dataset is a powerful resource for anyone interested in the intersection of sports and data science, providing both historical insights and a platform for further analysis.
Dataset and code for "What make readers love a fiction book: a stat analysis...
zenodo.org
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AISDL Team; AISDL Team (2024). Dataset and code for "What make readers love a fiction book: a stat analysis on Wild Wise Weird using real-world data from Amazon readers' reviews" [Dataset]. http://doi.org/10.5281/zenodo.14498846
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14498846
Dataset updated
Dec 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
AISDL Team; AISDL Team
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For the sake of research transparency and reducing research and reproducibility costs, we have stored all data and computer code of the project "What make readers love a fiction book: a stat analysis on Wild Wise Weird using real-world data from Amazon readers' reviews" on Zenodo.
H
Data Management Project for Collaborative Groundwater Research
hydroshare.org
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor (2025). Data Management Project for Collaborative Groundwater Research [Dataset]. https://www.hydroshare.org/resource/faa268eaa07547938d0e696247fc81fd
Explore at:
zip(2.1 GB)Available download formats
Dataset updated
Apr 24, 2025
Dataset provided by
HydroShare
Authors
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
Data from: Knowledge graphs for seismic data and metadata
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Davis; Cassandra Hunt (2023). Knowledge graphs for seismic data and metadata [Dataset]. http://doi.org/10.6078/D1P430
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D1P430
Dataset updated
Sep 19, 2023
Dataset provided by
RelationalAI, Inc.
University of California, San Diego
Authors
William Davis; Cassandra Hunt
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The increasing scale and diversity of seismic data, and the growing role of big data in seismology, has raised interest in methods to make data exploration more accessible. This paper presents the use of knowledge graphs (KGs) for representing seismic data and metadata to improve data exploration and analysis, focusing on usability, flexibility, and extensibility. Using constraints derived from domain knowledge in seismology, we define semantic models of seismic station and event information used to construct the KGs. Our approach utilizes the capability of KGs to integrate data across many sources and diverse schema formats. We use schema-diverse, real-world seismic data to construct KGs with millions of nodes, and illustrate potential applications with three big-data examples. Our findings demonstrate the potential of KGs to enhance the efficiency and efficacy of seismological workflows in research and beyond, indicating a promising interdisciplinary future for this technology. Methods The data here consists of, and was collected from:

Station metadata, in StationXML format, acquired from IRIS DMC using the fdsnws-station webservice (https://service.iris.edu/fdsnws/station/1/). Earthquake event data, in NDK format, acquired from the Global Centroid-Moment Tensor (GCMT) catalog webservice (https://www.globalcmt.org) [1,2]. Earthquake event data, in CSV format, acquired from the USGS earthquake catalog webservice (https://doi.org/10.5066/F7MS3QZH) [3].

The format of the data is described in the README. In addition, a complete description of the StationXML, NDK, and USGS file formats can be found at https://www.fdsn.org/xml/station/, https://www.ldeo.columbia.edu/~gcmt/projects/CMT/catalog/allorder.ndk_explained, and https://earthquake.usgs.gov/data/comcat/#event-terms, respectively. Also provided are conversions from NDK and StationXML file formats into JSON format. References: [1] Dziewonski, A. M., Chou, T. A., & Woodhouse, J. H. (1981). Determination of earthquake source parameters from waveform data for studies of global and regional seismicity. Journal of Geophysical Research: Solid Earth, 86(B4), 2825-2852. [2] Ekström, G., Nettles, M., & Dziewoński, A. M. (2012). The global CMT project 2004–2010: Centroid-moment tensors for 13,017 earthquakes. Physics of the Earth and Planetary Interiors, 200, 1-9. [3] U.S. Geological Survey, Earthquake Hazards Program, 2017, Advanced National Seismic System (ANSS) Comprehensive Catalog of Earthquake Events and Products: Various, https://doi.org/10.5066/F7MS3QZH.
A
DEREChOS: Data Environment for Rapid Exploration and Characterization of...
data.amerigeoss.org
html
Updated Jul 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). DEREChOS: Data Environment for Rapid Exploration and Characterization of Organized Systems [Dataset]. https://data.amerigeoss.org/ca/dataset/derechos-data-environment-for-rapid-exploration-and-characterization-of-organized-systems
Explore at:
htmlAvailable download formats
Dataset updated
Jul 26, 2019
Dataset provided by
United States[old]
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Motivation/Problem Statement: DEREChOS is a natural advancement of the existing, highly-successful Automated Event Service (AES) project. AES is an advanced system that facilitates efficient exploration and analysis of Earth science data. While AES is well-suited for the original purpose of searching for phenomena in regularly gridded data (e.g., reanalyses), targeted extensions would enable a much broader class of Earth science investigations to exploit the performance and flexibility of this service. We present a relevancy scenario, Event-based Hydrometeorological Science Data Analysis, which highlights the need for these features that would maximize the potential of DEREChOS for scientific research.

Proposed solution: We propose to develop DEREChOS, an extension of AES, that: (1) generalizes the underlying representation to support irregularly spaced observations such as point and swath data, (2) incorporates appropriate re-gridding and interpolation utilities to enable analysis across data from different sources, (3) introduces nonlinear dimensionality reduction (NDR) to facilitate identification of scientific relationships among high-dimensional datasets, and (4) integrates Moving Object Database technology to improve treatment of continuity for the events with coarse representation in time. With these features, DEREChOS will become a powerful environment that is appropriate for a very wide variety of Earth science analysis scenarios.

Research strategy: DEREChOS will be created by integrating various separately developed technologies. In most cases this will require some re-implementation to exploit SciDB, the underlying database that has strong support for multidimensional scientific data. Where possible, synthetic data/inputs will be generated to facilitate independent testing of new components. A scientific use case will be used to derive specific interface requirements and to demonstrate integration success.

Significance: Freshwater resources are predicted to be a major focus of contention and conflict in the 21st century. Thus, hydrometeorology and hydrology communities are particularly attracted by the superior research productivity through AES, which has been demonstrated for two real-world use cases. This interest is reflected by the participation in DEREChOS of our esteemed collaborators, who include the Project Scientist of NASA SMAP, the Principal Scientist of NOAA MRMS, and lead algorithm developers of NASA GPM.

Relevance to the Program Element: This proposal responds to the core AIST program topic: 2.1.3 Data-Centric-Technologies. DEREChOS specifically addresses the request for big data analytics, including tools and techniques for data fusion and data mining, applied to the substantial data and metadata that result from Earth science observation and the use of other data-centric technologies.

TRL: Although AES will have achieved an exit TRL of 5 by the start date of this proposed project, DEREChOS will have an entry TRL of 3 due to the new innovations that have not previously been implemented within the underlying SciDB database. We expect that DEREChOS will have an exit TRL of 5 corresponding to an end-to-end test of the full system in a relevant environment.

Facebook

Twitter

Click to copy link

Link copied

Cite

Crawl Feeds (2025). Walmart products free dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-products-free-dataset

Walmart products free dataset

Walmart products free dataset from Walmart.com

Explore at:

zip, csvAvailable download formats

Dataset updated

Apr 27, 2025

Dataset authored and provided by

Crawl Feeds

License

https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

Description

Discover the Walmart Products Free Dataset, featuring 2,000 records in CSV format. This dataset includes detailed information about various Walmart products, such as names, prices, categories, and descriptions.

It’s perfect for data analysis, e-commerce research, and machine learning projects. Download now and kickstart your insights with accurate, real-world data.

Clear search

Close search

Google apps

Main menu

Walmart products free dataset

EPX/USD Binance Historical Data for ANN

About Dataset

BIGAN

Expensive but Worth It: Live Projects in Statistics, Data Science, and...

NIST Statistical Reference Datasets - SRD 140

Replication data for: Stochastic and Deterministic Modeling of the Future...

Rmd code survival federated.

McKinsey Solve Assessment Data (2018–2025)

McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

📌 Inspiration & Purpose

🔍 Dataset Source

🧾 Dataset Structure

✅ Why Use This Dataset

💡 Credit & Collaboration

A dataset of Data Subject Access Request Packages

Overview

Purpose and Usage

Data Generation

Data Cleansing

Data Characterization

Authors

Accompanying Paper

Educational Attainment in North Carolina Public Schools: Use of statistical...

A Remote Introductory Biology Lab Using Backyard Birdwatching to Teach Data...

Synthetic Student Profiles with Academic Outcomes Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

R code dataset derivation centralized.

persona_omni_verse

Milestones and study timeline.

Moneyball: The Analytics Revolution in Baseball

Dataset Overview

Inspiration and Acknowledgements

Potential Uses

Dataset and code for "What make readers love a fiction book: a stat analysis...

Data Management Project for Collaborative Groundwater Research

Data from: Knowledge graphs for seismic data and metadata

DEREChOS: Data Environment for Rapid Exploration and Characterization of...

Walmart products free dataset

Walmart products free dataset from Walmart.com