100+ datasets found

Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
Meta Kaggle Code
kaggle.com
zip
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(148301844275 bytes)Available download formats
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
Coursera Courses Uncleaned Dataset to Practice
kaggle.com
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janak Pariyar (2024). Coursera Courses Uncleaned Dataset to Practice [Dataset]. https://www.kaggle.com/datasets/janakpariyar/coursera-courses-uncleaned-dataset-to-practice/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Janak Pariyar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The data set is web scraped from the Coursera website. The data is static. It consists of 7 columns with various unstructured data, which might help you on your learning curve of Data Science and Data Analytics . Feel free to play around . Happy Digging :)
A
‘Coursera Course Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Coursera Course Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-coursera-course-dataset-839a/86aaffe7/?iid=003-735&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Coursera Course Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/siddharthm1698/coursera-course-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This is a dataset i generated during a hackathon for project purpose. Here i have scrapped data from Coursera official web site. Our project aims to help any new learner get the right course to learn by just answering a few questions. It is an intelligent course recommendation system. Hence we had to scrap data from few educational websites. This is data scrapped from Coursera website. For the project visit: https://github.com/Siddharth1698/Coursu . Please do show your support by following us. I have just started to learn on data science and hope this dataset will be helpful to someone for his/her personal purposes. The scrapping code is here : https://github.com/Siddharth1698/Coursera-Course-Dataset Article about the dataset generation : https://medium.com/analytics-vidhya/web-scraping-and-coursera-8db6af45d83f

Content

This dataset contains mainly 6 columns and 890 course data. The detailed description: 1. course_title : Contains the course title. 2. course_organization : It tells which organization is conducting the courses. 3. course_Certificate_type : It has details about what are the different certifications available in courses. 4. course_rating : It has the ratings associated with each course. 5. course_difficulty : It tells about how difficult or what is the level of the course. 6. course_students_enrolled : It has the number of students that are enrolled in the course.

Inspiration

This is just one of my first scrapped dataset. Follow my GitHub for more: https://github.com/Siddharth1698

--- Original source retains full ownership of the source dataset ---
Kaggle DS Survey 2019
kaggle.com
Updated Dec 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Asri (2019). Kaggle DS Survey 2019 [Dataset]. https://www.kaggle.com/datasets/alanasri/kaggle-ds-survey-2019
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alan Asri
Description
Context

This notebook contains a thorough analysis and explanation related to the survey conducted by Kaggle. The survey was conducted on respondents from work backgrounds, age variations, where they lived, the companies where they worked. Survey questions contain about the world of the field they work in related to Data Scient and Machine Learning.

Content

The following Explanatory Data Analysis is taking data from survey results conducted by Kaggle in 2019 on respondents who give questions about Mechine Learning and Data Scients. Some core points that are in this analysis are as follows, 1. Graph Distribution Age with Formal Education 2. Plot Graph Company and Spent Money in Mechine Learning 3. Comparison spent cost level in Mechine Learning by each company 4. Data Scientist Experience & Their Compensation 5. Correlation between Mechine Learning Experience and Salary benefit 6. Correlation Data Scientist with his Compensation 7. Favourite Media source on Data Scients Topic 8. Favourite media by Age Distribution, Most Likely media by Data Scientist 9. Course Platform for Data Scientist 10. Role Job for each Title, Primary Job of Data Scientist 11. Reguler Programming Languange by Job Title, especially for Data Scientist 12. Comparison Ability spesific programming and Compensation 13. What is the Languange programming learn first aspiring Data Scientist? 14. Integrated Development Environments reguler basis 15. Top 5 IDE and Which Country is using it. Microsoft not dominant in USA 16. What is Notebook as majority likely as a Reguler Basis. Google domination 17. Which Country and What Company use What Hardware for Mechine Learning 18. Role Job based on Spesific Company Type 19. Computer Vision method mostly used by Company 20. Distribution Company by each country 21. Cloud Product, Amazon domination, Goole follow 22. Big Data Product, Amazon majority in Enterprise, Google majority in All

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
A
‘School Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘School Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-school-dataset-3c70/2a80983f/?iid=004-128&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘School Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/smeilisa07/number of school teacher student class on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is my first analyst data. This dataset i got from open data Jakarta website (http://data.jakarta.go.id/), so mostly the dataset is in Indonesian. But i have try describe it that you can find it on VARIABLE DESCRIPTION.txt file.

Content

The title of this dataset is jumlah-sekolah-guru-murid-dan-ruang-kelas-menurut-jenis-sekolah-2011-2016, with type is CSV, so you can easily access it. If you not understand, the title means the number of school, teacher, student, and classroom according to the type of school 2011 - 2016. I think, if you just read from the title, you can imagine the contents. So this dataset have 50 observations and 8 variables, taken from 2011 until 2016.

In general, this dataset is about the quality of education in Jakarta, which each year some of school level always decreasing and some is increase, but not significant.

Acknowledgements

This dataset comes from Indonesian education authorities, which is already established in the CSV file by Open Data Jakarta.

Inspiration

Althought this data given from Open Data Jakarta publicly, i want always continue to improve my Data Scientist skill, especially in R programming, because i think R programming is easy to learn and really help me to be always curious about Data Scientist. So, this dataset that I am still struggle with below problem, and i need solution.

Question :

How can i cleaning this dataset ? I have try cleaning this dataset, but i still not sure. You can check on
my_hypothesis.txt file, when i try cleaning and visualize this dataset.

How can i specify the model for machine learning ? What recommended steps i should take ?

How should i cluster my dataset, if i want the label is not number but tingkat_sekolah for every tahun and
jenis_sekolah ? You can check on my_hypothesis.txt file.

--- Original source retains full ownership of the source dataset ---
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
Spacekit Data Archive
zenodo.org
data.niaid.nih.gov
application/gzip, csv +1
Updated Aug 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ru Kein; Ru Kein (2023). Spacekit Data Archive [Dataset]. http://doi.org/10.5281/zenodo.8231215
Explore at:
application/gzip, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8231215
Dataset updated
Aug 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ru Kein; Ru Kein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Collection of datasets, models and training results for spacekit machine learning algorithms. To learn more, please visit https://spacekit.readthedocs.io/en/latest/

Versioning note: modifications to existing uploads are indicated by major version iterations (e.g. 1.0, 2.0, 3.0); new file additions are denoted by minor version increments (e.g. 1.1, 1.2, 1.3) since these are inherently backwards compatible.
data-science-job-salaries
huggingface.co
Updated Aug 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fastai X Hugging Face Group 2022 (2022). data-science-job-salaries [Dataset]. https://huggingface.co/datasets/hugginglearners/data-science-job-salaries
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2022
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
fastai X Hugging Face Group 2022
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Data Science Job Salaries

Dataset Summary Content

Column Description

work_year The year the salary was paid.

experience_level The experience level in the job during the year with the following possible values: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director

employment_type The type of employement for the role: PT Part-time FT Full-time CT Contract FL Freelance

job_title… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/data-science-job-salaries.
Data Science London + Scikit-learn
kaggle.com
zip
Updated Dec 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
newman (2017). Data Science London + Scikit-learn [Dataset]. https://www.kaggle.com/newman123/data-science-london-scikitlearn
Explore at:
zip(1547456 bytes)Available download formats
Dataset updated
Dec 8, 2017
Authors
newman
Area covered
London
Description
Dataset

This dataset was created by newman

Contents
o
Multi-feature Golf Play Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Multi-feature Golf Play Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/23026657-8212-4f36-84a0-f6064a0b889b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
Area covered
Education & Learning Analytics
Description
This is the Extended Golf Play Dataset, a rich and detailed collection designed to expand upon the classic golf dataset [1]. It incorporates a wide array of features suitable for various data science applications and is especially valuable for teaching purposes [1]. The dataset is organised in a long format, where each row represents a single observation and often includes textual data, such as player reviews or comments [2]. It contains a special set of mini datasets, each tailored to a specific teaching point, for example, demonstrating data cleaning or combining datasets [1]. These are ideal for beginners to practise with real examples and are complemented by notebooks with step-by-step guides [1].

Columns

The dataset features a variety of columns, including core, extra, and text-based attributes: * ID: A unique identifying number for each player [1]. * Date: The specific day the data was recorded or the golf session took place [1, 2]. * Weekday: The day of the week, with numerical representation (e.g., 0 for Sunday, 1 for Monday) [1, 3]. * Holiday: Indicates whether the day was a special holiday (Yes/No), specifically noted for holidays in Japan (1 for yes, 0 for no) [1, 3]. * Month: The month in which golf was played [3]. * Season: The time of year, such as spring, summer, autumn, or winter [1, 3]. * Outlook: Describes the weather conditions during the session (e.g., sunny, cloudy, rainy, snowy) [1, 3]. * Temperature: The ambient temperature during the golf session, recorded in Celsius [1, 3]. * Humidity: The percentage of moisture in the air [1, 3]. * Windy: A boolean indicator (True/False or 1 for yes, 0 for no) if it was windy [1, 3]. * Crowded-ness: A measure of how busy the golf course was, ranging from 0 to 1 [1, 4]. * PlayTime-Hour: The duration for which people played golf, in hours [1]. * Play: Indicates whether golf was played or not (Yes/No) [1]. * Review: Textual feedback from players about their day at golf [1]. * EmailCampaign: Text content of emails sent daily by the golf place [1]. * MaintenanceTasks: Descriptions of work carried out to maintain the golf course [1].

Distribution

This dataset is organised in a long format, meaning each row represents a single observation [2]. Data files are typically in CSV format, with sample files updated separately to the platform [5]. Specific numbers for rows or records are not currently available within the provided sources. The dataset also includes a special collection of mini datasets within its structure [1].

Usage

This dataset is highly versatile and ideal for learning and applying various data science skills: * Data Visualisation: Learn to create graphs and identify patterns within the data [1]. * Predictive Modelling: Discover which data points are useful for predicting if golf will be played [1]. * Data Cleaning: Practise spotting and managing data that appears incorrect or inconsistent [1]. * Time Series Analysis: Understand how various factors change over time, such as daily or monthly trends [1, 2]. * Data Grouping: Learn to combine similar days or observations together [1]. * Text Analysis: Extract insights from textual features like player reviews, potentially for sentiment analysis or thematic extraction [1, 2]. * Recommendation Systems: Develop models to suggest optimal times to play golf based on historical data [1]. * Data Management: Gain experience in managing and analysing data structured in a long format, which is common for repeated measures [2].

Coverage

The dataset's regional coverage is global [6]. While the Date column records the day the data was captured or the session occurred, no specific time range for the collected data is stated beyond the listing date of 11/06/2025 [1, 6]. Demographic scope includes unique player IDs [1], but no specific demographic details or data availability notes for particular groups or years are provided.

License

CC-BY

Who Can Use It

This dataset is designed for a broad audience: * New Learners: It is easy to understand and comes with guides to aid the learning process [1]. * Teachers: An excellent resource for conducting classes on data visualisation and interpretation [1]. * Researchers: Suitable for testing novel data analysis methodologies [1]. * Students: Can acquire a wide range of skills, from making graphs to understanding textual data and building recommendation systems [1].

Dataset Name Suggestions

Golf Play Extended Analytics

Advanced Golf Session Data

Long Format Golf Insights

Multi-feature Golf Play Dataset

Textual Golf Data for Learning

Attributes

Original Data Source: ⛳️ Golf Play Dataset Extended
t
Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...
test.researchdata.tuwien.ac.at
bin, csv, json +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
Explore at:
csv, text/markdown, json, binAvailable download formats
Unique identifier
https://doi.org/10.70124/f5t2d-xt904
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db

Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9

Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.

Store: A unique identifier for each store.

Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

Customers: The number of customers visiting the store on a given day.

Open: An indicator of whether the store was open (1 = open, 0 = closed).

StateHoliday: Indicates if the day is a state holiday, with values like:

'a' = public holiday,

'b' = Easter holiday,

'c' = Christmas,

'0' = no holiday.

SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

Assortment: Describes the level of product assortment in the store:

'a' = basic,

'b' = extra,

'c' = extended.

CompetitionDistance: Distance (in meters) to the nearest competitor store.

CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

Python Libraries: Key libraries for working with the dataset include:

pandas for data manipulation,

numpy for numerical operations,

matplotlib and seaborn for data visualization,

scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
W
Data from: Ask-the-Expert: Minimizing Human Review for Big Data Analytics...
cloud.csiss.gmu.edu
html
Updated Jan 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2020). Ask-the-Expert: Minimizing Human Review for Big Data Analytics through Active Learning [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/ask-the-expert-minimizing-human-review-for-big-data-analytics-through-active-learning
Explore at:
htmlAvailable download formats
Dataset updated
Jan 29, 2020
Dataset provided by
United States
Description
In order to learn the operational significance of anomalies using active learning, we will first get a ranked list of statistically significant anomalies by running a data driven anomaly detection method such as NASA’s Multiple Kernel Anomaly Detection (MKAD) or Inductive Monitoring System (IMS) on our dataset. A very small percentage of these anomalies (~5) will then be given to an SME to assess their operational significance. We will build a classifier taking only these few labeled examples. We plan to incorporate SME rationale by engineering new features as conjunctions and disjunctions of original features into the iterative learning process. Data points, about which the classifier is most uncertain, will then be presented interactively to the SME, and the classifier updated after each input from the SME. The process will continue until a desired accuracy is reached or the expert has analyzed ‘enough’ examples
o
Utrecht Housing / Dutch housing market
opendatabay.com
.undefined
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vdt. Data (2025). Utrecht Housing / Dutch housing market [Dataset]. https://www.opendatabay.com/data/financial/3b2c2355-46d1-448b-ac33-22523e89212a
Explore at:
.undefinedAvailable download formats
Dataset updated
Feb 28, 2025
Dataset authored and provided by
Vdt. Data
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Utrecht, Netherlands, Urban Planning & Infrastructure
Description
The Utrecht Housing Dataset is a synthetic dataset designed for students and practitioners to learn about data science and machine learning. Derived from the Dutch housing market, it is high-quality and noise-free, making it suitable for multiple algorithms such as decision trees, linear regression, logistic regression, and neural networks. This dataset was specifically created for educational purposes and emphasises responsible AI by being accessible to learners with diverse academic backgrounds.

Dataset Features:

id: Unique identifier for each house, ranging from 0 to 100,000 (not used in algorithms).

zipcode: Zip code of the house's location, indicating its area. Possible values: 3520, 3525, 3800.

lot-len: Length of the house plot in meters, ranging from 5.0 to 100.0.

lot-width: Width of the house plot in meters, ranging from 5.0 to 100.0.

lot-area: Total area of the house plot in square meters, derived from lot-len * lot-width.

house-area: The living area of the house in square meters (e.g., 30.0 for small houses, 200.0 for mansions).

garden-size: The size of the garden in square meters, with larger gardens being desirable.

balcony: Number of balconies (common values: 0, 1, 3). x-coor: X-coordinate of the house's location (range: 2000 to 3000).

y-coor: Y-coordinate of the house's location (range: 5000 to 6000).

buildyear: The year the house was built (from as early as 1100 to modern times).

bathrooms: Number of bathrooms (common values: 1, 2, or 3). Output/Target Features

tax value: Estimated value of the house for taxation, ranging from 50,000 to 1,000,000 euros.

Retail value: The market value of the house, also ranges from 50,000 to 1,000,000 euros.

energy-eff: Binary indicator (0 or 1) of whether the house is energy-efficient.

monument: Binary indicator (0 or 1) of whether the house has architectural or historical monumental value.

Usage:

The dataset is ideal for: - Machine Learning Applications: Training and testing predictive models for tax valuation, market value, and energy efficiency. - Feature Analysis: Exploring the relationships between housing attributes and target values. - Educational Purposes: Teaching students about regression, classification, and feature engineering. - Visualisation: Creating plots and graphs due to the well-structured and interpretable data.

Coverage:

The dataset provides a comprehensive representation of housing features relevant to the Dutch market, ensuring high usability for educational and experimental projects.

License:

CC0 (Public Domain)

Who Can Use It:

This dataset is designed for students, researchers, data scientists, and machine learning practitioners seeking to explore real-world applications of AI in housing markets.

How to Use It:

Develop predictive models for tax and retail value estimation.

Evaluate housing energy efficiency or monumental status using classification techniques.

Explore feature importance to understand what drives housing value.

Benchmark machine learning algorithms on a synthetic, high-quality dataset.
30 Short Tips for Your Data Scientist Interview
kaggle.com
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Skillslash17
Description
If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

Technical Preparation

Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

1 Master the Basics

Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

2 Understand Machine Learning

Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

3 Data Manipulation

Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

4 SQL Skills

Gain proficiency in the use of SQL language to extract and process data from databases.

5 Feature Engineering

Understand and know the importance of feature engineering and how to create meaningful features from raw data.

6 Model Evaluation

Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

7 Big Data Technologies

If the job requires it, become familiar with big data technologies like Hadoop and Spark.

8 Coding Challenges

Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

Portfolio and Projects

9 Build a Portfolio

Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

10 Kaggle Competitions

Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

11 Open Source Contributions

Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

12 GitHub Profile

Maintain a well-organized GitHub profile with clean code and clear project documentation.

Domain Knowledge

13 Understand the Industry

Research the industry you’re applying to and understand its specific data challenges and opportunities.

14 Company Research

Study the company you’re interviewing with to tailor your responses and show your genuine interest.

Soft Skills

15 Communication

Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

16 Problem-Solving

Focus on your problem-solving abilities and how you approach complex challenges.

17 Adaptability

Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

Interview Etiquette

18 Professional Appearance

Dress and present yourself in a professional manner, whether the interview is in person or remote.

19 Punctuality

Be on time for the interview, whether it’s virtual or in person.

20 Body Language

Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

21 Active Listening

Pay close attention to the interviewer's questions and answer them directly.

Behavioral Questions

22 STAR Method

Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

23 Conflict Resolution

Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

24 Teamwork

Highlight instances where you’ve worked effectively in cross-functional teams...
o
AI Question Answering Data
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). AI Question Answering Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides essential information for entries related to question answering tasks using AI models. It is designed to offer valuable insights for researchers and practitioners, enabling them to effectively train and rigorously evaluate their machine learning models. The dataset serves as a valuable resource for building and assessing question-answering systems. It is available free of charge.

Columns

instruction: Contains the specific instructions given to a model to generate a response.

responses: Includes the responses generated by the model based on the given instructions.

next_response: Provides the subsequent response from the model, following a previous response, which facilitates a conversational interaction.

answer: Lists the correct answer for each question presented in the instruction, acting as a reference for assessing the model's accuracy.

is_human_response: A boolean column that indicates whether a particular response was created by a human or by a machine learning model, helping to differentiate between the two. Out of nearly 19,300 entries, 254 are human-generated responses, while 18,974 were generated by models.

Distribution

The data files are typically in CSV format, with a dedicated train.csv file for training data and a test.csv file for testing purposes. The training file contains a large number of examples. Specific dates are not included within this dataset description, focusing solely on providing accurate and informative details about its content and purpose. Specific numbers for rows or records are not detailed in the available information.

Usage

This dataset is ideal for a variety of applications and use cases: * Training and Testing: Utilise train.csv to train question-answering models or algorithms, and test.csv to evaluate their performance on unseen questions. * Machine Learning Model Creation: Develop machine learning models specifically for question-answering by leveraging the instructional components, including instructions, responses, next responses, and human-generated answers, along with their is_human_response labels. * Model Performance Evaluation: Assess model performance by comparing predicted responses with actual human-generated answers from the test.csv file. * Data Augmentation: Expand existing data by paraphrasing instructions or generating alternative responses within similar contexts. * Conversational Agents: Build conversational agents or chatbots by utilising the instruction-response pairs for training. * Language Understanding: Train models to understand language and generate responses based on instructions and previous responses. * Educational Materials: Develop interactive quizzes or study guides, with models providing instant feedback to students. * Information Retrieval Systems: Create systems that help users find specific answers from large datasets. * Customer Support: Train customer support chatbots to provide quick and accurate responses to inquiries. * Language Generation Research: Develop novel algorithms for generating coherent responses in question-answering scenarios. * Automatic Summarisation Systems: Train systems to generate concise summaries by understanding main content through question answering. * Dialogue Systems Evaluation: Use the instruction-response pairs as a benchmark for evaluating dialogue system performance. * NLP Algorithm Benchmarking: Establish baselines against which other NLP tools and methods can be measured.

Coverage

The dataset's geographic scope is global. There is no specific time range or demographic scope noted within the available details, as specific dates are not included.

License

CC0

Who Can Use It

This dataset is highly suitable for: * Researchers and Practitioners: To gain insights into question answering tasks using AI models. * Developers: To train models, create chatbots, and build conversational agents. * Students: For developing educational materials and enhancing their learning experience through interactive tools. * Individuals and teams working on Natural Language Processing (NLP) projects. * Those creating information retrieval systems or customer support solutions. * Experts in natural language generation (NLG) and automatic summarisation systems. * Anyone involved in the evaluation of dialogue systems and machine learning model training.

Dataset Name Suggestions

AI Question Answering Data

Conversational AI Training Data

NLP Question-Answering Dataset

Model Evaluation QA Data

Dialogue Response Dataset

Attributes

Original Data Source: Question-Answering Training and Testing Data
A
‘Gufhtugu Publications Dataset Challenge’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Gufhtugu Publications Dataset Challenge’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-gufhtugu-publications-dataset-challenge-0764/0bd8674f/?iid=006-565&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Gufhtugu Publications Dataset Challenge’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/gufhtugu-publications-dataset-challenge on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is the one of its kinds book sales dataset from Pakistan. It contains 20,000 book orders from January 2019 to January 2021. The data was collected from the merchant (Gufhtugu Publications www.Gufhtugu.com) who are partner in this research study. There is a dire need for such dataset to learn about Pakistan’s emerging e-commerce potential and I hope this will help many startups in many ways.

Content

Geography: Pakistan

Time period: 01/2019 – 01/2021

Unit of analysis: E-Commerce Orders

Dataset: The dataset contains detailed information of 200,000 online book orders in Pakistan from January 2019 to January 2021. It contains order number, order status (completed, cancelled, returned), order date and time, book name and city address. This is the most detailed dataset about e-commerce orders in Pakistan that you can find in the Public domain.

Variables: The dataset contains order number, order status, book name, order date, order time and city of the customer.

Size: 1.5 MB

File Type: CSV

Acknowledgements

I like to thank all the startups who are trying to make their mark in Pakistan despite the unavailability of research data. Thanks to Gufhtugu Publications (www.Gufhtugu.com) for allowing me to run this challenge.

Inspiration

I’d like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas:

• What is the best-selling book? • Visualize order status frequency • Find a correlation between date and time with order status • Find a correlation between city and order status • Find any hidden patterns that are counter-intuitive for a layman • Can we predict number of orders, or book names in advance?

--- Original source retains full ownership of the source dataset ---
t
Telco_Customer_churn_Data
test.researchdata.tuwien.at
bin, csv, png
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erum Naz; Erum Naz; Erum Naz; Erum Naz (2025). Telco_Customer_churn_Data [Dataset]. http://doi.org/10.82556/b0ch-cn44
Explore at:
png, csv, binAvailable download formats
Unique identifier
https://doi.org/10.82556/b0ch-cn44
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Erum Naz; Erum Naz; Erum Naz; Erum Naz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Context and Methodology

The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).

The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.

The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.

Technical Details

The dataset has a tabular structure and was initially stored in CSV format. It contains:

Rows: 7,043 customer records

Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).

Naming Convention:

The table in the database is named telco_customer_churn_data.

Software Requirements:

To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).

For machine learning applications, libraries such as pandas, scikit-learn, and joblib are typically used.

Additional Resources:

Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn

Further Details

When reusing the dataset, users should be aware:

Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).

Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
o
Question Classification: Android or iOS?
opendatabay.com
.undefined
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Question Classification: Android or iOS? [Dataset]. https://www.opendatabay.com/data/ai-ml/26d2a278-3fe1-435d-95a8-0dc936a0b351
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 27, 2025
Dataset authored and provided by
Datasimple
Area covered
Software and Technology
Description
Context Imagine you have to process bug reports about an application your company is developing, which is available for both Android and iOS. Could you find a way to automatically classify them so you can send them to the right support team?

Content The dataset contains data from two StackExchange forums: Android Enthusiasts and Ask Differently (Apple). I pre-processed both datasets from the raw XML files retrieved from Internet Archive in order to only contain useful information for building Machine Learning classifiers. In the case of the Apple forum, I narrowed down to the subset of questions that have one of the following tags: "iOS", "iPhone", "iPad".

Think of this as a fun way to learn to build ML classifiers! The training, validation and test sets are all available, but in order to build robust models please try to use the test set as little as possible (only as a last validation for your models).

Acknowledgements The image was retrieved from unsplash and made by @thenewmalcolm. Link to image here.

The data was made available for free under a CC-BY-SA 4.0 license by StackExchange and hosted by Internet Archive. Find it here.

License

CC-BY-SA

Original Data Source: Question Classification: Android or iOS?
d
Data for: Integrating open education practices with data analysis of open...
search.dataone.org
data.niaid.nih.gov
Updated Jul 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marja Bakermans (2024). Data for: Integrating open education practices with data analysis of open science in an undergraduate course [Dataset]. http://doi.org/10.5061/dryad.37pvmcvst
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.37pvmcvst
Dataset updated
Jul 27, 2024
Dataset provided by
Dryad Digital Repository
Authors
Marja Bakermans
Description
The open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored â€˜1â€™ or â€˜0â€™ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course

Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24â€“0314

Data and file overview

The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available

BestPracticesData.csv

Description: Data to assess the adherence of articles and datasets to open science best practices.

Column headers and descriptions:

Article: articles used in the study, numbered randomly

F1: Findable, Data are assigned a unique and persistent doi

F2: Findable, Metadata includes an identifier of data

F3: Findable, Data are registered in a searchable database

A1: ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 30, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rupesh Kumar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas
Chapter 2: Analysis: Bringing it all together and making decisions
Chapter 3: Appending to DataFrame
Chapter 4: Boolean indexing of dataframes
Chapter 5: Categorical data
Chapter 6: Computational Tools
Chapter 7: Creating DataFrames
Chapter 8: Cross sections of different axes with MultiIndex
Chapter 9: Data Types
Chapter 10: Dealing with categorical variables
Chapter 11: Duplicated data
Chapter 12: Getting information about DataFrames
Chapter 13: Gotchas of pandas
Chapter 14: Graphs and Visualizations
Chapter 15: Grouping Data
Chapter 16: Grouping Time Series Data
Chapter 17: Holiday Calendars
Chapter 18: Indexing and selecting data
Chapter 19: IO for Google BigQuery
Chapter 20: JSON
Chapter 21: Making Pandas Play Nice With Native Python Datatypes
Chapter 22: Map Values
Chapter 23: Merge, join, and concatenate
Chapter 24: Meta: Documentation Guidelines
Chapter 25: Missing Data
Chapter 26: MultiIndex
Chapter 27: Pandas Datareader
Chapter 28: Pandas IO tools (reading and saving data sets)
Chapter 29: pd.DataFrame.apply
Chapter 30: Read MySQL to DataFrame
Chapter 31: Read SQL Server to Dataframe
Chapter 32: Reading files into pandas DataFrame
Chapter 33: Resampling
Chapter 34: Reshaping and pivoting
Chapter 35: Save pandas dataframe to a csv file
Chapter 36: Series
Chapter 37: Shifting and Lagging Data
Chapter 38: Simple manipulation of DataFrames
Chapter 39: String manipulation
Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
Chapter 41: Working with Time Series

Clear search

Close search

Google apps

Main menu

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Coursera Courses Uncleaned Dataset to Practice

‘Coursera Course Dataset’ analyzed by Analyst-2

Context

Content

Inspiration

Kaggle DS Survey 2019

Context

Content

Acknowledgements

Inspiration

‘School Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

machine learning models on the WDBC dataset

Spacekit Data Archive

data-science-job-salaries

Data Science London + Scikit-learn

Dataset

Contents

Multi-feature Golf Play Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Data from: Ask-the-Expert: Minimizing Human Review for Big Data Analytics...

Utrecht Housing / Dutch housing market

Dataset Features:

Usage:

Coverage:

License:

Who Can Use It:

How to Use It:

30 Short Tips for Your Data Scientist Interview

1 Master the Basics

2 Understand Machine Learning

3 Data Manipulation

4 SQL Skills

5 Feature Engineering

6 Model Evaluation

7 Big Data Technologies

8 Coding Challenges

9 Build a Portfolio

10 Kaggle Competitions

11 Open Source Contributions

12 GitHub Profile

13 Understand the Industry

14 Company Research

15 Communication

16 Problem-Solving

17 Adaptability

18 Professional Appearance

19 Punctuality

20 Body Language

21 Active Listening

22 STAR Method

23 Conflict Resolution

24 Teamwork

AI Question Answering Data