100+ datasets found

H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
Authors
Jamie Monogan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
f
Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s003
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
TidY_PracticE_DatasetS
kaggle.com
zip
Updated Jun 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DEBALINA MITRA (2023). TidY_PracticE_DatasetS [Dataset]. https://www.kaggle.com/datasets/debalinamitra/tidy-practice-datasets
Explore at:
zip(139335 bytes)Available download formats
Dataset updated
Jun 24, 2023
Authors
DEBALINA MITRA
Description
Original dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.

Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.

Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC
Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Fitness Tracker Data Analysis with R
kaggle.com
zip
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nargis Karimova (2022). Fitness Tracker Data Analysis with R [Dataset]. https://www.kaggle.com/datasets/nargiskarimova/fitness-tracker-data-analysis-with-r
Explore at:
zip(31712 bytes)Available download formats
Dataset updated
Jun 3, 2022
Authors
Nargis Karimova
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Nargis Karimova

Released under CC0: Public Domain

Contents
First IMF Final Practice with R
kaggle.com
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Carbonell Capo (2023). First IMF Final Practice with R [Dataset]. https://www.kaggle.com/datasets/pepcarbonell/first-imf-final-practice-with-r/code
Explore at:
zip(486316 bytes)Available download formats
Dataset updated
Nov 29, 2023
Authors
Jose Carbonell Capo
Description
Dataset

This dataset was created by Jose Carbonell Capo

Contents
U
Replication Data for: A Review of Best Practice Recommendations for...
dataverse-staging.rdmc.unc.edu
datasearch.gesis.org
Updated Nov 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Wesslen; Ryan Wesslen (2017). Replication Data for: A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App) [Dataset]. http://doi.org/10.15139/S3/R4W7ZS
Explore at:
csv(1070619), application/x-rlang-transport(1014184), pdf(76215), text/x-r-markdown(14242), text/x-r-markdown(12162), html(2930583), application/x-rlang-transport(2108553), docx(24677), html(2442743), html(1689406), text/markdown(1958), application/x-rlang-transport(1623238), text/x-r-markdown(12252)Available download formats
Unique identifier
https://doi.org/10.15139/S3/R4W7ZS
Dataset updated
Nov 7, 2017
Dataset provided by
UNC Dataverse
Authors
Ryan Wesslen; Ryan Wesslen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).
d
Data for: Integrating open education practices with data analysis of open...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marja Bakermans (2024). Data for: Integrating open education practices with data analysis of open science in an undergraduate course [Dataset]. http://doi.org/10.5061/dryad.37pvmcvst
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.37pvmcvst
Dataset updated
Jul 27, 2024
Dataset provided by
Dryad Digital Repository
Authors
Marja Bakermans
Description
The open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored â€˜1â€™ or â€˜0â€™ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course

Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24â€“0314

Data and file overview

The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available

BestPracticesData.csv

Description: Data to assess the adherence of articles and datasets to open science best practices.

Column headers and descriptions:

Article: articles used in the study, numbered randomly

F1: Findable, Data are assigned a unique and persistent doi

F2: Findable, Metadata includes an identifier of data

F3: Findable, Data are registered in a searchable database

A1: ...
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
f
LIFE Database fire practice data for analysis in R
royalholloway.figshare.com
xlsx
Updated Jan 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cathy Smith; Jay Mistry (2022). LIFE Database fire practice data for analysis in R [Dataset]. http://doi.org/10.17637/rh.14823810.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.17637/rh.14823810.v2
Dataset updated
Jan 31, 2022
Dataset provided by
Royal Holloway, University of London
Authors
Cathy Smith; Jay Mistry
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Excel spreadsheet containing only fire practice data from the Livelihood Fire Database (LIFE), as used for analysis of the database in R.
R-code, Dataset, Analysis and output (2012-2020): Occupancy and Probability...
catalog.data.gov
datasets.ai
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). R-code, Dataset, Analysis and output (2012-2020): Occupancy and Probability of Detection for Bachman's Sparrow (Aimophila aestivalis), Northern Bobwhite (Collinus virginianus), and Brown-headed Nuthatch (Sitta pusilla) to Habitat Management Practices on Carolina Sandhills NWR [Dataset]. https://catalog.data.gov/dataset/r-code-dataset-analysis-and-output-2012-2020-occupancy-and-probability-of-detection-for-ba
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
U.S. Fish and Wildlife Servicehttp://www.fws.gov/
Description
This reference contains the R-code for the analysis and summary of detections of Bachman's sparrow, bobwhite quail and brown-headed nuthatch through 2020. Specifically generates probability of detection and occupancy of the species based on call counts and elicited calls with playback. The code loads raw point count (CSV files) and fire history data (CSV) and cleans/transforms into a tidy format for occupancy analysis. It then creates the necessary data structure for occupancy analysis, performs the analysis for the three focal species, and provides functionality for generating tables and figures summarizing the key findings of the occupancy analysis. The raw data, point count locations and other spatial data (ShapeFiles) are contained in the dataset.
s
Climate Companions: Research Data Package
orda.shef.ac.uk
docx
Updated Nov 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Belfield (2025). Climate Companions: Research Data Package [Dataset]. http://doi.org/10.15131/shef.data.30112993.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.30112993.v1
Dataset updated
Nov 4, 2025
Dataset provided by
The University of Sheffield
Authors
Andrew Belfield
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Climate Companions’ was a two-year, practice-based, Participatory Action Research (PAR) in Poplar, East London. The research explored the potential of design-driven civic pedagogies in nurturing agency toward more resilient urban futures. It asked the questions: Which design tools, learning methods and spaces enable transformative civic learning?Civic Pedagogies seek to realise civic action through projects of emancipatory learning occurring at the edges and outside our academic institutions. This co-inquiry aimed to extend theories and practices in this field by utilising co-design to collaboratively shape two civic pedagogies (2022-2023). These were nested within the R-Urban Poplar eco-civic hub, an urban common and part of a network of ecological hubs supporting circularity and civic resilience in cities.The data uploaded here relate to post-evaluation semi-structured interviews conducted with project participants between March 2023 and August 2024. These transcriptions formed the basis of the data analysis alongside auto-ethnographic reflections for this PhD Thesis.
Dataset for 'From vision toward best practices: Evaluating in vitro...
catalog.data.gov
datasets.ai
Updated Jun 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). Dataset for 'From vision toward best practices: Evaluating in vitro transcriptomic points of departure for application in risk assessment using a uniform workflow' [Dataset]. https://catalog.data.gov/dataset/dataset-for-from-vision-toward-best-practices-evaluating-in-vitro-transcriptomic-points-of
Explore at:
Dataset updated
Jun 29, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data for Reardon AJF, et al., From vision toward best practices: Evaluating in vitro transcriptomic points of departure for application in risk assessment using a uniform workflow. Front. Toxicol. 5:1194895. doi: 10.3389/ftox.2023.1194895. PMC10242042. This dataset is associated with the following publication: Reardon, A., R. Farmahin, A. Williams, M. Meier, G. Addicks, C. Yauk, G. Matteo, E. Atlas, J. Harrill, L. Everett, I. Shah, R. Judson, S. Ramaiahgari, S. Ferguson, and T. Barton-Maclaren. From vision toward best practices: Evaluating in vitro transcriptomic points of departure for application in risk assessment using a uniform workflow. Frontiers in Toxicology. Frontiers, Lausanne, SWITZERLAND, 5: 1194895, (2023).
Cohort Analysis for Startups - Six Summary Reports to Understand Your...
tomtunguz.com
Updated Aug 12, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomasz Tunguz (2013). Cohort Analysis for Startups - Six Summary Reports to Understand Your Customer Base (With Code) - Data Analysis [Dataset]. https://tomtunguz.com/cohort-analysis-with-r/
Explore at:
Dataset updated
Aug 12, 2013
Dataset provided by
Theory Ventures
Authors
Tomasz Tunguz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Learn 6 essential cohort analysis reports to track SaaS growth, revenue trends, and customer churn. Data-driven insights with code examples for startup founders.
Study Hours vs Grades Dataset
kaggle.com
zip
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
Explore at:
zip(33964 bytes)Available download formats
Dataset updated
Oct 12, 2025
Authors
Andrey Silva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

Dataset Features

student_id: Unique identifier for each student (1-5000)

study_hours: Hours spent studying (0-12 hours, continuous)

grade: Final exam score (0-100 points, continuous)

Potential Use Cases

Linear regression modeling and practice

Data visualization exercises

Statistical analysis tutorials

Machine learning for beginners

Educational research simulations

Data Quality

No missing values

Normally distributed residuals

Realistic educational scenario

Ready for immediate analysis

Data Generation Code

This dataset was generated using R.

R Code

# Set seed for reproducibility set.seed(42) # Define number of observations (students) n <- 5000 # Generate study hours (independent variable) # Uniform distribution between 0 and 12 hours study_hours <- runif(n, min = 0, max = 12) # Create relationship between study hours and grade # Base grade: 40 points # Each study hour adds an average of 5 points # Add normal noise (standard deviation = 10) theoretical_grade <- 40 + 5 * study_hours # Add normal noise to make it realistic noise <- rnorm(n, mean = 0, sd = 10) # Calculate final grade grade <- theoretical_grade + noise # Limit grades between 0 and 100 grade <- pmin(pmax(grade, 0), 100) # Create the dataframe dataset <- data.frame( student_id = 1:n, study_hours = round(study_hours, 2), grade = round(grade, 2) )
Data from: Data accessibility in the chemical sciences: an analysis of...
zenodo.org
data.niaid.nih.gov
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerys Willoughby; Cerys Willoughby; Sally Bloodworth; Sally Bloodworth; Simon J. Coles; Simon J. Coles (2024). Data accessibility in the chemical sciences: an analysis of recent practice in organic chemistry journals [Dataset]. http://doi.org/10.5281/zenodo.13928084
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13928084
Dataset updated
Oct 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cerys Willoughby; Cerys Willoughby; Sally Bloodworth; Sally Bloodworth; Simon J. Coles; Simon J. Coles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Data is the analysis of the data outputs of 240 randomly selected research papers from 12 top-ranked journals published in early 2023. We investigate author compliance with recommended (but not compulsory) data policies, whether there is evidence to suggest that authors apply FAIR data guidance in their data publishing, and if the existence of specific recommendations for publishing NMR data by some journals encourages compliance. Files in the data package have been provided in both human and machine-readable forms. The main dataset is available in the Excel file Data worksheet.XLSX, the contents of which can also be found in Main_dataset.CSV, Data_types.CSV, and Article_selection.CSV with explanations of the variable coding used in the studies in Variable_names.CSV, Codes.CSV, and FAIR_variable_coding.CSV. The R code used for the article selection can be found in Article_selection.R. Data about article types from the journals that contain original research data is in Article_types.CSV. Data collected for analysis in our sister paper[4] can be found in Extended_Adherence.CSV, Extended_Crystallography.CSV, Extended_DAS.CSV, Extended_File_Types.CSV, and Extended_Submission_Process.CSV. A full list of files in the data package and a short description for each is given in README.TXT.
SAP FI Anomaly Detection - Prepared Data & Models
kaggle.com
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aidsmlProjects (2025). SAP FI Anomaly Detection - Prepared Data & Models [Dataset]. https://www.kaggle.com/datasets/aidsmlprojects/sap-fi-anomaly-detection-prepared-data-and-models
Explore at:
zip(9285 bytes)Available download formats
Dataset updated
Apr 30, 2025
Authors
aidsmlProjects
Description
Intelligent SAP Financial Integrity Monitor

Project Status: Proof-of-Concept (POC) - Capstone Project

Overview

This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.

The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.

This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.

Author: Anitha R (https://www.linkedin.com/in/anithaswamy)

Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.

Motivation

Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud

Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.

Key Features

Data Cleansing & Preparation: Rigorous process to handle common SAP data extract issues (duplicates, financial imbalance), prioritizing FAGLFLEXA for reliability.

Exploratory Data Analysis (EDA): Uncovered baseline patterns in posting times, user activity, amounts, and process context.

Feature Engineering: Created 16 context-aware features (FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.

Hybrid Anomaly Detection:

Ensemble ML: Utilized unsupervised models: Isolation Forest (IF), Local Outlier Factor (LOF) (via Scikit-learn), and an Autoencoder (AE) (via TensorFlow/Keras).

Expert Rules (HRFs): Implemented highly customizable High-Risk Flags based on percentile thresholds and SAP logic (e.g., weekend posting, missing cost center).

Multi-Faceted Prioritization: Combined ML model consensus (Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.

Contextual Anomaly Reason: Generated a Review_Focus text description summarizing why an item was flagged.

Interactive Dashboard (Streamlit):

File upload for anomaly/feature data.

Overview KPIs (including multi-currency "Value at Risk by CoCode").

Comprehensive filtering capabilities.

Dynamic visualizations (User/Doc Type/HRF frequency, Time Trends).

Interactive AgGrid table for anomaly list investigation.

Detailed drill-down view for selected anomalies.

Methodology Overview

The project followed a structured approach:

Phase 1: Data Quality Assessment & Preparation: Cleaned and validated raw BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.

Phase 2: Exploratory Data Analysis & Feature Engineering: Analyzed cleaned data patterns and engineered 16 features quantifying anomaly indicators. Resulted in sap_engineered_features.csv.

Phase 3: Baseline Anomaly Detection & Evaluation: Scaled features, applied IF and LOF models, evaluated initial results.

Phase 4: Advanced Modeling & Prioritization: Trained Autoencoder model, combined all model outputs and HRFs, implemented prioritization logic, generated context, and created the final anomaly list.

Phase 5: UI Development: Built the Streamlit dashboard for interactive analysis and investigation.

(For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).

Technology Stack

Core Language: Python 3.x

Data Manipulation & Analysis: Pandas, NumPy

Machine Learning: Scikit-learn (IsolationForest, LocalOutlierFactor, StandardScaler), TensorFlow/Keras (Autoencoder)

Visualization: Matplotlib, Seaborn, Plotly Express

Dashboard: Streamlit, streamlit-aggrid

Utilities: Joblib (for saving scaler)

Libraries:

Model/Scaler Saving

joblib==1.4.2

Data I/O Efficiency (Optional but good practice if used)

pyarrow==19.0.1

Machine L...
Store Sales Dataset
kaggle.com
zip
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nimisha Davis (2025). Store Sales Dataset [Dataset]. https://www.kaggle.com/datasets/drnimishadavis/store-sales-dataset
Explore at:
zip(562846 bytes)Available download formats
Dataset updated
Sep 22, 2025
Authors
Nimisha Davis
Description
This dataset contains retail sales records from a superstore, including detailed information on orders, products, categories, sales, discounts, profits, customers, and regions.

It is widely used for business intelligence, data visualization, and machine learning projects. With features such as order date, ship mode, customer segment, and geographic region, the dataset is excellent for:

Sales forecasting

Profitability analysis

Market basket analysis

Customer segmentation

Data visualization practice (Tableau, Power BI, Excel, Python, R)

Inspiration:

Great dataset for learning how to build dashboards.

Commonly used in case studies for predictive analytics and decision-making.

Source: Originally inspired by a sample dataset frequently used in Tableau training and BI case studies.
WIC Infant and Toddler Feeding Practices Study-2 (WIC ITFPS-2): Prenatal,...
agdatacommons.nal.usda.gov
txt
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA FNS Office of Policy Support (2025). WIC Infant and Toddler Feeding Practices Study-2 (WIC ITFPS-2): Prenatal, Infant Year 5 Year Datasets [Dataset]. http://doi.org/10.15482/USDA.ADC/1528196
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1528196
Dataset updated
Nov 21, 2025
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
Food and Nutrition Servicehttps://www.fns.usda.gov/
Authors
USDA FNS Office of Policy Support
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The WIC Infant and Toddler Feeding Practices Study–2 (WIC ITFPS-2) (also known as the “Feeding My Baby Study”) is a national, longitudinal study that captures data on caregivers and their children who participated in the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) around the time of the child’s birth. The study addresses a series of research questions regarding feeding practices, the effect of WIC services on those practices, and the health and nutrition outcomes of children on WIC. Additionally, the study assesses changes in behaviors and trends that may have occurred over the past 20 years by comparing findings to the WIC Infant Feeding Practices Study–1 (WIC IFPS-1), the last major study of the diets of infants on WIC. This longitudinal cohort study has generated a series of reports. These datasets include data from caregivers and their children during the prenatal period and during the children’s first five years of life (child ages 1 to 60 months). A full description of the study design and data collection methods can be found in Chapter 1 of the Second Year Report (https://www.fns.usda.gov/wic/wic-infant-and-toddler-feeding-practices-st...). A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Processing methods and equipment used Data in this dataset were primarily collected via telephone interview with caregivers. Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. The study team cleaned the raw data to ensure the data were as correct, complete, and consistent as possible. Study date(s) and duration Data collection occurred between 2013 and 2019. Study spatial scale (size of replicates and spatial scale of study area) Respondents were primarily the caregivers of children who received WIC services around the time of the child’s birth. Data were collected from 80 WIC sites across 27 State agencies. Level of true replication Unknown Sampling precision (within-replicate sampling or pseudoreplication) This dataset includes sampling weights that can be applied to produce national estimates. A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Level of subsampling (number and repeat or within-replicate sampling) A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Study design (before–after, control–impacts, time series, before–after-control–impacts) Longitudinal cohort study. Description of any data manipulation, modeling, or statistical analysis undertaken Each entry in the dataset contains caregiver-level responses to telephone interviews. Also available in the dataset are children’s length/height and weight data, which were objectively collected while at the WIC clinic or during visits with healthcare providers. In addition, the file contains derived variables used for analytic purposes. The file also includes weights created to produce national estimates. The dataset does not include any personally-identifiable information for the study children and/or for individuals who completed the telephone interviews. Description of any gaps in the data or other limiting factors Please refer to the series of annual WIC ITFPS-2 reports (https://www.fns.usda.gov/wic/infant-and-toddler-feeding-practices-study-2-fourth-year-report) for detailed explanations of the study’s limitations. Outcome measurement methods and equipment used The majority of outcomes were measured via telephone interviews with children’s caregivers. Dietary intake was assessed using the USDA Automated Multiple Pass Method (https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-h...). Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. Resources in this dataset:Resource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSV. File Name: itfps2_enrollto60m_publicuse.csvResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data Codebook. File Name: ITFPS2_EnrollTo60m_PUF_Codebook.pdfResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Enroll60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data CSV. File Name: ampm_1to60_ana_publicuse.csvResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Tot Codebook.pdfResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Ana Codebook.pdfResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Ana_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data CSV. File Name: ampm_1to60_tot_publicuse.csvResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Tot_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSV. File Name: ampm_foodgroup_1to60m_publicuse.csvResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data Codebook. File Name: AMPM_FoodGroup_1to60m_Codebook.pdfResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Foodgroup_60_SAS_SPSS_STATA_R.zipResource Title: WIC Infant and Toddler Feeding Practices Study-2 Data File Training Manual. File Name: WIC_ITFPS-2_DataFileTrainingManual.pdf
Meta Kaggle Code
kaggle.com
zip
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(167219625372 bytes)Available download formats
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Facebook

Twitter

Click to copy link

Link copied

Cite

Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI

Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/ARKOTI

Dataset updated

Apr 28, 2020

Dataset provided by

Harvard Dataverse

Authors

Jamie Monogan

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.

Clear search

Close search

Google apps

Main menu

Political Analysis Using R: Example Code and Data, Plus Data for Practice...

Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene...

TidY_PracticE_DatasetS

Collection of example datasets used for the book - R Programming -...

Fitness Tracker Data Analysis with R

Dataset

Contents

First IMF Final Practice with R

Dataset

Contents

Replication Data for: A Review of Best Practice Recommendations for...

Data for: Integrating open education practices with data analysis of open...

Data and file overview

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

LIFE Database fire practice data for analysis in R

R-code, Dataset, Analysis and output (2012-2020): Occupancy and Probability...

Climate Companions: Research Data Package

Dataset for 'From vision toward best practices: Evaluating in vitro...

Cohort Analysis for Startups - Six Summary Reports to Understand Your...

Study Hours vs Grades Dataset

Dataset Features

Potential Use Cases

Data Quality

Data Generation Code

R Code

Data from: Data accessibility in the chemical sciences: an analysis of...

SAP FI Anomaly Detection - Prepared Data & Models

Intelligent SAP Financial Integrity Monitor

Overview

Motivation

Key Features

Methodology Overview

Technology Stack

Model/Scaler Saving

Data I/O Efficiency (Optional but good practice if used)

pyarrow==19.0.1

Machine L...

Store Sales Dataset

WIC Infant and Toddler Feeding Practices Study-2 (WIC ITFPS-2): Prenatal,...

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems