Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Facebook
TwitterOriginal dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.
Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.
Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Nargis Karimova
Released under CC0: Public Domain
Facebook
TwitterThis dataset was created by Jose Carbonell Capo
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).
Facebook
TwitterThe open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored ‘1’ or ‘0’ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course
Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24–0314
The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Excel spreadsheet containing only fire practice data from the Livelihood Fire Database (LIFE), as used for analysis of the database in R.
Facebook
TwitterThis reference contains the R-code for the analysis and summary of detections of Bachman's sparrow, bobwhite quail and brown-headed nuthatch through 2020. Specifically generates probability of detection and occupancy of the species based on call counts and elicited calls with playback. The code loads raw point count (CSV files) and fire history data (CSV) and cleans/transforms into a tidy format for occupancy analysis. It then creates the necessary data structure for occupancy analysis, performs the analysis for the three focal species, and provides functionality for generating tables and figures summarizing the key findings of the occupancy analysis. The raw data, point count locations and other spatial data (ShapeFiles) are contained in the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Climate Companions’ was a two-year, practice-based, Participatory Action Research (PAR) in Poplar, East London. The research explored the potential of design-driven civic pedagogies in nurturing agency toward more resilient urban futures. It asked the questions: Which design tools, learning methods and spaces enable transformative civic learning?Civic Pedagogies seek to realise civic action through projects of emancipatory learning occurring at the edges and outside our academic institutions. This co-inquiry aimed to extend theories and practices in this field by utilising co-design to collaboratively shape two civic pedagogies (2022-2023). These were nested within the R-Urban Poplar eco-civic hub, an urban common and part of a network of ecological hubs supporting circularity and civic resilience in cities.The data uploaded here relate to post-evaluation semi-structured interviews conducted with project participants between March 2023 and August 2024. These transcriptions formed the basis of the data analysis alongside auto-ethnographic reflections for this PhD Thesis.
Facebook
TwitterData for Reardon AJF, et al., From vision toward best practices: Evaluating in vitro transcriptomic points of departure for application in risk assessment using a uniform workflow. Front. Toxicol. 5:1194895. doi: 10.3389/ftox.2023.1194895. PMC10242042. This dataset is associated with the following publication: Reardon, A., R. Farmahin, A. Williams, M. Meier, G. Addicks, C. Yauk, G. Matteo, E. Atlas, J. Harrill, L. Everett, I. Shah, R. Judson, S. Ramaiahgari, S. Ferguson, and T. Barton-Maclaren. From vision toward best practices: Evaluating in vitro transcriptomic points of departure for application in risk assessment using a uniform workflow. Frontiers in Toxicology. Frontiers, Lausanne, SWITZERLAND, 5: 1194895, (2023).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Learn 6 essential cohort analysis reports to track SaaS growth, revenue trends, and customer churn. Data-driven insights with code examples for startup founders.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.
This dataset was generated using R.
# Set seed for reproducibility
set.seed(42)
# Define number of observations (students)
n <- 5000
# Generate study hours (independent variable)
# Uniform distribution between 0 and 12 hours
study_hours <- runif(n, min = 0, max = 12)
# Create relationship between study hours and grade
# Base grade: 40 points
# Each study hour adds an average of 5 points
# Add normal noise (standard deviation = 10)
theoretical_grade <- 40 + 5 * study_hours
# Add normal noise to make it realistic
noise <- rnorm(n, mean = 0, sd = 10)
# Calculate final grade
grade <- theoretical_grade + noise
# Limit grades between 0 and 100
grade <- pmin(pmax(grade, 0), 100)
# Create the dataframe
dataset <- data.frame(
student_id = 1:n,
study_hours = round(study_hours, 2),
grade = round(grade, 2)
)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data is the analysis of the data outputs of 240 randomly selected research papers from 12 top-ranked journals published in early 2023. We investigate author compliance with recommended (but not compulsory) data policies, whether there is evidence to suggest that authors apply FAIR data guidance in their data publishing, and if the existence of specific recommendations for publishing NMR data by some journals encourages compliance. Files in the data package have been provided in both human and machine-readable forms. The main dataset is available in the Excel file Data worksheet.XLSX, the contents of which can also be found in Main_dataset.CSV, Data_types.CSV, and Article_selection.CSV with explanations of the variable coding used in the studies in Variable_names.CSV, Codes.CSV, and FAIR_variable_coding.CSV. The R code used for the article selection can be found in Article_selection.R. Data about article types from the journals that contain original research data is in Article_types.CSV. Data collected for analysis in our sister paper[4] can be found in Extended_Adherence.CSV, Extended_Crystallography.CSV, Extended_DAS.CSV, Extended_File_Types.CSV, and Extended_Submission_Process.CSV. A full list of files in the data package and a short description for each is given in README.TXT.
Facebook
TwitterProject Status: Proof-of-Concept (POC) - Capstone Project
This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.
The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.
This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.
Author: Anitha R (https://www.linkedin.com/in/anithaswamy)
Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.
Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud
Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.
FAGLFLEXA for reliability.FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.Review_Focus text description summarizing why an item was flagged.The project followed a structured approach:
BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.sap_engineered_features.csv.(For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).
Libraries:
joblib==1.4.2
Facebook
TwitterThis dataset contains retail sales records from a superstore, including detailed information on orders, products, categories, sales, discounts, profits, customers, and regions.
It is widely used for business intelligence, data visualization, and machine learning projects. With features such as order date, ship mode, customer segment, and geographic region, the dataset is excellent for:
Sales forecasting
Profitability analysis
Market basket analysis
Customer segmentation
Data visualization practice (Tableau, Power BI, Excel, Python, R)
Inspiration:
Great dataset for learning how to build dashboards.
Commonly used in case studies for predictive analytics and decision-making.
Source: Originally inspired by a sample dataset frequently used in Tableau training and BI case studies.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The WIC Infant and Toddler Feeding Practices Study–2 (WIC ITFPS-2) (also known as the “Feeding My Baby Study”) is a national, longitudinal study that captures data on caregivers and their children who participated in the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) around the time of the child’s birth. The study addresses a series of research questions regarding feeding practices, the effect of WIC services on those practices, and the health and nutrition outcomes of children on WIC. Additionally, the study assesses changes in behaviors and trends that may have occurred over the past 20 years by comparing findings to the WIC Infant Feeding Practices Study–1 (WIC IFPS-1), the last major study of the diets of infants on WIC. This longitudinal cohort study has generated a series of reports. These datasets include data from caregivers and their children during the prenatal period and during the children’s first five years of life (child ages 1 to 60 months). A full description of the study design and data collection methods can be found in Chapter 1 of the Second Year Report (https://www.fns.usda.gov/wic/wic-infant-and-toddler-feeding-practices-st...). A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Processing methods and equipment used Data in this dataset were primarily collected via telephone interview with caregivers. Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. The study team cleaned the raw data to ensure the data were as correct, complete, and consistent as possible. Study date(s) and duration Data collection occurred between 2013 and 2019. Study spatial scale (size of replicates and spatial scale of study area) Respondents were primarily the caregivers of children who received WIC services around the time of the child’s birth. Data were collected from 80 WIC sites across 27 State agencies. Level of true replication Unknown Sampling precision (within-replicate sampling or pseudoreplication) This dataset includes sampling weights that can be applied to produce national estimates. A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Level of subsampling (number and repeat or within-replicate sampling) A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Study design (before–after, control–impacts, time series, before–after-control–impacts) Longitudinal cohort study. Description of any data manipulation, modeling, or statistical analysis undertaken Each entry in the dataset contains caregiver-level responses to telephone interviews. Also available in the dataset are children’s length/height and weight data, which were objectively collected while at the WIC clinic or during visits with healthcare providers. In addition, the file contains derived variables used for analytic purposes. The file also includes weights created to produce national estimates. The dataset does not include any personally-identifiable information for the study children and/or for individuals who completed the telephone interviews. Description of any gaps in the data or other limiting factors Please refer to the series of annual WIC ITFPS-2 reports (https://www.fns.usda.gov/wic/infant-and-toddler-feeding-practices-study-2-fourth-year-report) for detailed explanations of the study’s limitations. Outcome measurement methods and equipment used The majority of outcomes were measured via telephone interviews with children’s caregivers. Dietary intake was assessed using the USDA Automated Multiple Pass Method (https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-h...). Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. Resources in this dataset:Resource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSV. File Name: itfps2_enrollto60m_publicuse.csvResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data Codebook. File Name: ITFPS2_EnrollTo60m_PUF_Codebook.pdfResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Enroll60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data CSV. File Name: ampm_1to60_ana_publicuse.csvResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Tot Codebook.pdfResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Ana Codebook.pdfResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Ana_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data CSV. File Name: ampm_1to60_tot_publicuse.csvResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Tot_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSV. File Name: ampm_foodgroup_1to60m_publicuse.csvResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data Codebook. File Name: AMPM_FoodGroup_1to60m_Codebook.pdfResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Foodgroup_60_SAS_SPSS_STATA_R.zipResource Title: WIC Infant and Toddler Feeding Practices Study-2 Data File Training Manual. File Name: WIC_ITFPS-2_DataFileTrainingManual.pdf
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.