58 datasets found

d
Python Script for Cleaning Alum Dataset
search.dataone.org
hydroshare.org
Updated Oct 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
Explore at:
Dataset updated
Oct 18, 2025
Dataset provided by
Hydroshare
Authors
saikumar payyavula; Jeff Sadler
Description
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

BI intro to data cleaning eda and machine learning

kaggle.com

zip

Updated Nov 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions

Explore at:

zip(9961 bytes)Available download formats

Dataset updated

Nov 17, 2025

Authors

Walekhwa Tambiti Leo Philip

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

Background

Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

Admissions decision-making
Academic support strategies
Overall program impact and ROI

Your Mission

Answer this central question:

“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

Key Strategic Areas

You are required to analyze and provide actionable insights for the following three areas:

1. Admissions Optimization

Should entry exams remain the primary admissions filter?

Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

✅ Deliverables:

Feature importance ranking for predicting Python and DB scores
Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
Business rationale and risk analysis

2. Curriculum Support Strategy

Are there at-risk student groups who need extra support?

Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

✅ Deliverables:

At-risk segment identification
Support program design (e.g., prep course, mentoring)
Expected outcomes, costs, and KPIs

3. Resource Allocation & Program ROI

How can we allocate resources for maximum student success?

Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

✅ Deliverables:

Performance drivers
Student segmentation
Resource allocation plan and ROI projection

🛠️ Dataset Overview

Column	Description
`fNAME`, `lNAME`	Student first and last name
`Age`	Student age (21–71 years)
`gender`	Gender (standardized as "Male"/"Female")
`country`	Student’s country of origin
`residence`	Student housing/residence type
`entryEXAM`	Entry test score (28–98)
`prevEducation`	Prior education (High School, Diploma, etc.)
`studyHOURS`	Total study hours logged
`Python`	Final Python exam score
`DB`	Final Database exam score

📊 Dataset

You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

Raw Dataset (Recommended for Full Project)

Download: bi.csv

This dataset includes common data quality challenges:

Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy

Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

Cleaned Dataset (Optional Shortcut)

Download: cleaned_bi.csv

This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

Finding_And_Visualizing_Missing_Data_Python
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Finding_And_Visualizing_Missing_Data_Python [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/finding-and-visualizing-missing-data-python
Explore at:
zip(371581 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
Confusion matrix.
plos.figshare.com
xls
Updated Oct 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t005
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Google Ads sales dataset
kaggle.com
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NayakGanesh007 (2025). Google Ads sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/google-ads-sales-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NayakGanesh007
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.

It is ideal for practicing:

Data cleaning

Exploratory Data Analysis (EDA)

Marketing analytics

Campaign performance insights

Dashboard creation using tools like Excel, Python, or Power BI

📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)

⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:

Inconsistent date formats

Spelling errors (e.g., "analitics", "anaytics")

Duplicate rows

Mixed units and symbols in cost/revenue columns

Missing values

Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")

🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel

Data preprocessing for machine learning

Campaign performance analysis

Conversion optimization tracking

Building dashboards in Power BI, Tableau, or Looker

💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)

Analyze click-through rates (CTR) by device or location

Clean and standardize campaign names and keywords

Investigate keyword performance vs. conversions

🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
Sample Park Analysis
figshare.com
zip
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30509021.v1
Dataset updated
Nov 2, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eric Delmelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn
h
security-cve
huggingface.co
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wanghaoyi (2025). security-cve [Dataset]. https://huggingface.co/datasets/whywhywhywhy/security-cve
Explore at:
Dataset updated
May 9, 2025
Authors
wanghaoyi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for security-cve Dataset Description The security-cve dataset is a cleaned and optimized version of the original ReposVul dataset, a high-quality collection of 6,134 CVE entries across 1,491 projects in C, C++, Java, and Python. It provides multi-granularity vulnerability information, from repository-level to line-level. The cleaning process enhanced data quality, making it suitable for training and evaluating machine learning models for vulnerability detection. Data Structure… See the full description on the dataset page: https://huggingface.co/datasets/whywhywhywhy/security-cve.
d
Joint commitment in human cooperative hunting through an “Imagined We”
datadryad.org
zip
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ning Tang; Siyi Gong; Minglu Zhao; Jifan Zhou; Mowei Shen; Tao Gao (2025). Joint commitment in human cooperative hunting through an “Imagined We” [Dataset]. http://doi.org/10.5061/dryad.brv15dvjn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvjn
Dataset updated
Aug 1, 2025
Dataset provided by
Dryad
Authors
Ning Tang; Siyi Gong; Minglu Zhao; Jifan Zhou; Mowei Shen; Tao Gao
Time period covered
Sep 3, 2024
Description
Cooperation involves the challenge of jointly selecting one from multiple goals while maintaining the team’s joint commitment to it. We test joint commitment in a multi-player hunting game, combining psychophysics and computational modeling. Joint commitment is modeled through an "Imagined We" (IW) approach, where each agent uses Bayesian inference to infer the intention of “We”, an imagined supraindividual agent controlling all agents as its body parts. This is compared against a Reward Sharing (RS) model, which frames cooperation through reward sharing via multi-agent reinforcement learning (MARL). Both humans and IW, but not RS, maintained high performance by jointly committing to a single prey, regardless of prey quantity or speed. Human observers rated all hunters in both human and IW teams as making high contributions to the catch, regardless of their proximity to the prey, suggesting that high-quality hunting stemmed from sophisticated cooperation rather than individual strategie...
d
Joint commitment in human cooperative hunting through an â€œImagined Weâ€
search.dataone.org
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ning Tang; Siyi Gong; Minglu Zhao; Jifan Zhou; Mowei Shen; Tao Gao (2025). Joint commitment in human cooperative hunting through an â€œImagined Weâ€ [Dataset]. http://doi.org/10.5061/dryad.brv15dvjn
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.brv15dvjn
Dataset updated
Aug 2, 2025
Dataset provided by
Dryad Digital Repository
Authors
Ning Tang; Siyi Gong; Minglu Zhao; Jifan Zhou; Mowei Shen; Tao Gao
Description
Cooperation involves the challenge of jointly selecting one from multiple goals while maintaining the teamâ€™s joint commitment to it. We test joint commitment in a multi-player hunting game, combining psychophysics and computational modeling. Joint commitment is modeled through an "Imagined We" (IW) approach, where each agent uses Bayesian inference to infer the intention of â€œWeâ€ , an imagined supraindividual agent controlling all agents as its body parts. This is compared against a Reward Sharing (RS) model, which frames cooperation through reward sharing via multi-agent reinforcement learning (MARL). Both humans and IW, but not RS, maintained high performance by jointly committing to a single prey, regardless of prey quantity or speed. Human observers rated all hunters in both human and IW teams as making high contributions to the catch, regardless of their proximity to the prey, suggesting that high-quality hunting stemmed from sophisticated cooperation rather than individual strategie..., Data Collection:The dataset was collected through offline laboratory experiments involving human subjects, machine simulations, and human-machine collaboration. The machine simulations and collaborative tasks were based on models implemented in Python, including reinforcement learning, neural networks, and Bayesian inference. Participants engaged in tasks designed to assess cognitive processes, with data recorded in controlled laboratory conditions. Data Processing:The collected data was processed using Pythonâ€™s pandas library. This involved data cleaning, transformation, and preparation for further analysis. For statistical analysis, we used the Jeffreysâ€™s Amazing Statistics Program (JASP) software to perform various statistical tests., , # Joint Commitment in Human Cooperative Hunting through an Imagined We

https://doi.org/10.5061/dryad.brv15dvjn

Location of the Data and Code

The data and code are compressed in the file â€˜imaginedWeCodeRelease.zipâ€™, which can be downloaded from the â€˜Filesâ€™ tab.

Description of the data and file structure

Data Collection:

The dataset was collected through offline laboratory experiments involving human subjects, machine simulations, and human-machine collaboration . The machine simulations and collaborative were based on models implemented in Python, including reinforcement learning, neural networks, and Bayesian inference. Participants engaged in tasks designed to assess cognitive processes, with data recorded in controlled laboratory conditions.

Data Processing:

The collected data was processed using Pythons pandas library. This involved data cleaning, transformation, and preparation for further analysis. For statistical ..., All human subjects data included in this dataset were collected with the explicit informed consent of participants, including their consent to publish the de-identified data in the public domain.

To ensure anonymity, participants are identified only by the time of their participation (e.g., "20221221-1550"). The data content and file names do not contain any personally identifiable information or demographic details such as names, gender, age, or other identifying characteristics.
BPNNs’ parameters and experimental results comparison table.
plos.figshare.com
xls
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). BPNNs’ parameters and experimental results comparison table. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BPNNs’ parameters and experimental results comparison table.
myntra_fashion_dataset
kaggle.com
zip
Updated Dec 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abolude Shina (2022). myntra_fashion_dataset [Dataset]. https://www.kaggle.com/datasets/aboludeshina/myntra-fashion-dataset
Explore at:
zip(2444019 bytes)Available download formats
Dataset updated
Dec 16, 2022
Authors
Abolude Shina
Description
Given 2 datasets, analyse the dataset. All the steps undertaken from importing data, cleaning, preparing data, and analysing data should be clearly mentioned. Appropriate visualisations and report of the results of analysis should be provided.

• On the given datasets, identify the questions that you would like to answer through data analysis. • Given two datasets, create a new dataset for analysis. • Perform data cleaning and pre-processing tasks on the new dataset. • Use Machine learning techniques to analyse data. • Perform visualisation using Python.
f
The parameters and experimental results comparison of RF-Adaboost models.
plos.figshare.com
xls
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). The parameters and experimental results comparison of RF-Adaboost models. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t004
Dataset updated
Oct 11, 2023
Dataset provided by
PLOS ONE
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The parameters and experimental results comparison of RF-Adaboost models.
Adaboost models’ parameters and experimental results comparison.
plos.figshare.com
xls
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). Adaboost models’ parameters and experimental results comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t003
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adaboost models’ parameters and experimental results comparison.
NYC_building_energy_data
kaggle.com
zip
Updated Mar 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksym Dubovyi (2020). NYC_building_energy_data [Dataset]. https://www.kaggle.com/maxbrain/nyc-building-energy-data
Explore at:
zip(9476304 bytes)Available download formats
Dataset updated
Mar 4, 2020
Authors
Maksym Dubovyi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
New York
Description
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:

Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.

This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.

Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!

Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:

Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!

This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.

With this outline in place to guide us, let's get started!
f
Data Sheet 9_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
xlsx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 9_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.xlsx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s010
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s010
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.

Facebook

Twitter

Click to copy link

Link copied

Cite

saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e

Python Script for Cleaning Alum Dataset

Explore at:

Dataset updated

Oct 18, 2025

Dataset provided by

Hydroshare

Authors

saikumar payyavula; Jeff Sadler

Description

This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.

Clear search

Close search

Google apps

Main menu

Python Script for Cleaning Alum Dataset

Ecommerce Dataset for Data Analysis

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Reddit r/AskScience Flair Dataset

Medical Clean Dataset

S1 Data -

BI intro to data cleaning eda and machine learning

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Background

Your Mission

Key Strategic Areas

1. Admissions Optimization

2. Curriculum Support Strategy

3. Resource Allocation & Program ROI

🛠️ Dataset Overview

📊 Dataset

Raw Dataset (Recommended for Full Project)

Cleaned Dataset (Optional Shortcut)

Finding_And_Visualizing_Missing_Data_Python

Confusion matrix.

Google Ads sales dataset

Sample Park Analysis

security-cve

Joint commitment in human cooperative hunting through an “Imagined We”

Joint commitment in human cooperative hunting through an â€œImagined Weâ€

Location of the Data and Code

Description of the data and file structure

BPNNs’ parameters and experimental results comparison table.

myntra_fashion_dataset

The parameters and experimental results comparison of RF-Adaboost models.

Adaboost models’ parameters and experimental results comparison.

NYC_building_energy_data

Data Sheet 9_Prediction of outpatient rehabilitation patient preferences and...

Python Script for Cleaning Alum Dataset