35 datasets found

Cleaning Practice with Errors & Missing Values
kaggle.com
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zuhair khan (2025). Cleaning Practice with Errors & Missing Values [Dataset]. https://www.kaggle.com/datasets/zuhairkhan13/cleaning-practice-with-errors-and-missing-values
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zuhair khan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.

It includes 500 rows of simulated employee data with intentional errors such as:

Missing values in Age and Salary

Typos in email addresses (@gamil.com)

Inconsistent city name casing (e.g., lahore, Karachi)

Extra spaces in department names (e.g., " HR ")

✅ Skills You Can Practice:

Detecting and handling missing data

String cleaning and formatting

Removing duplicates

Validating email formats

Standardizing categorical data

You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
Science Education Research Topic Modeling Dataset
zenodo.org
data.niaid.nih.gov
bin, html +2
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
Explore at:
bin, txt, html, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4094974
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.

We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.

We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)

We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.

We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.

We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).

We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data

Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.

Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Student Performance and Learning Behavior Dataset for Educational Analytics
zenodo.org
bin, csv
Updated Aug 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16459132
Dataset updated
Aug 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kamal NAJEM; Kamal NAJEM
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 26, 2025
Description
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

The dataset covers the following categories of variables:

Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resource access and learning environment: Resources, Internet, EduTech

Motivation and psychological factors: Motivation, StressLevel

Demographic information: Gender, Age (ranging from 18 to 30 years)

Learning preference classification: LearningStyle

Academic performance indicators: ExamScore, FinalGrade

In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

Data Preprocessing –

Encoding categorical variables using LabelEncoder.

Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

Detecting and removing duplicates.

Clustering Analysis –

Applying K-Means clustering to segment learners into distinct profiles.

Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
d
Compilation of surface water diversion sites and daily withdrawals in the...
catalog.data.gov
data.usgs.gov
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Compilation of surface water diversion sites and daily withdrawals in the Upper Colorado River and Little Colorado River Basins, 1980-2022 [Dataset]. https://catalog.data.gov/dataset/compilation-of-surface-water-diversion-sites-and-daily-withdrawals-in-the-upper-color-1980
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Colorado River, Little Colorado River
Description
This data release contains an inventory of 1,358 major surface water diversion structures with associated daily time series withdrawal records (1980-2022) for structures within the Upper Colorado River and Little Colorado River Basins. Diversion structures were included in this dataset if they were determined to have the capacity to divert water at rates greater than 10 cubic feet per second. Since those river basins encompasses portions of five states, water use data are dispersed among numerous federal and state agency databases and there is no centralized dataset that documents surface water use within the entire UCOL at a fine spatial and temporal resolution. Diversion structures and locations were identified from a mix of state reports, maps, and satellite imagery. A Python script was developed to automate retrieval of daily time series withdrawal records from multiple state and federal databases. The script was also used to process, filter, and harmonize the diversion records to remove outlier values and estimate missing data. The original withdrawal data, the processed datasets, and the Python script are included in this data release.
t
Tour Recommendation Model
test.researchdata.tuwien.at
bin, png +1
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
Explore at:
text/markdown, png, binAvailable download formats
Unique identifier
https://doi.org/10.70124/akpf6-8p175
Dataset updated
May 14, 2025
Dataset provided by
TU Wien
Authors
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Dataset Description for Tour Recommendation Model

Context and Methodology:

Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

Technical Details:

Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

place_or_event_id: Unique identifier for each tourist place or event.

rating: Rating given by the user, ranging from 1 to 5.

The data is split into three subsets:

Training Set: 80% of the dataset used to train the model.

Validation Set: A small portion used for hyperparameter tuning.

Test Set: 20% used to evaluate model performance.

Folder and File Naming Conventions:
The dataset files are stored in the following structure:

user_ratings_dataset.csv: The original dataset file containing user ratings.

tour_recommendation_model.pkl: The saved model after training.

actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

Software Requirements:
To open and work with this dataset, the following software and libraries are required:

Python 3.x

Pandas for data manipulation

Scikit-learn for training and evaluating machine learning models

Matplotlib for chart generation

Joblib for saving and loading the trained model

The dataset can be opened and processed using any Python environment that supports these libraries.

Additional Resources:

The model training code, README file, and performance chart are available in the project repository.

For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

Further Details:

Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

Train other types of models (e.g., regression, classification).

Experiment with different features or add more metadata to enrich the dataset.

Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
Cleaned Netflix Dataset for EDA
kaggle.com
zip
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil raman K (2025). Cleaned Netflix Dataset for EDA [Dataset]. https://www.kaggle.com/datasets/nikhilramank/cleaned-netflix-dataset-for-eda
Explore at:
zip(750797 bytes)Available download formats
Dataset updated
Jul 7, 2025
Authors
Nikhil raman K
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a cleaned version of a Netflix movies dataset prepared for exploratory data analysis (EDA). Missing values have been handled, invalid rows removed, and numerical + categorical columns cleaned for analysis using Python and Pandas.
s
Data from: Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric...
scholardata.sun.ac.za
data.mendeley.com
Updated Mar 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen (2025). Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle Performance Tracking (2023) [Dataset]. http://doi.org/10.25413/sun.28554200.v1
Explore at:
Unique identifier
https://doi.org/10.25413/sun.28554200.v1
Dataset updated
Mar 8, 2025
Dataset provided by
SUNScholarData
Authors
Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nairobi
Description
This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

n
Genome-wide SNP datasets for the non-native pink salmon in Norway
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simo Njabulo Maduna; Paul Eric Aspholm; Ane-Sofie Bednarczyk Hansen; Cornelya Klütsch; Snorre Hagen (2024). Genome-wide SNP datasets for the non-native pink salmon in Norway [Dataset]. http://doi.org/10.5061/dryad.zw3r228f2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zw3r228f2
Dataset updated
Feb 5, 2024
Dataset provided by
Norwegian Institute of Bioeconomy Research
Authors
Simo Njabulo Maduna; Paul Eric Aspholm; Ane-Sofie Bednarczyk Hansen; Cornelya Klütsch; Snorre Hagen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Norway
Description
Effective management of non-indigenous species requires knowledge of their dispersal factors and founder events. We aim to identify the main environmental drivers favouring dispersal events along the invasion gradient and to characterize the spatial patterns of genetic diversity in feral populations of the non-native pink salmon within its epicentre of invasion in Norway. We first conducted SDM using four modelling techniques with varying levels of complexity, which encompassed both regression-based and tree-based machine-learning algorithms, using climatic data from the present to 2050. Then we used the triple-enzyme restriction-site associated DNA sequencing (3RADseq) approach to genotype over 30,000 high-quality single-nucleotide polymorphisms to elucidate patterns of genetic diversity and gene flow within the pink salmon putative invasion hotspot. We discovered temperature- and precipitation-related variables drove pink salmon distributional shifts across its non-native ranges, and that climate-induced favourable areas will remain stable for the next 30 years. In addition, all SDMs identified north-eastern Norway as the epicentre of the pink salmon invasion, and genomic data revealed that there was minimal variation in genetic diversity across the sampled populations at a genome-wide level in this region. While, upon utilizing a specific group of ‘diagnostic’ SNPs, we observed a significant degree of genetic differentiation, ranging from moderate to substantial, and detected four hierarchical genetic clusters concordant with geography. Our findings suggest that fluctuations of climate extreme events associated with ongoing climate change will likely maintain environmental favourability for the pink salmon outside its ‘native’/introduced ranges. Local invaded rivers are themselves a potential source population of invaders in the ongoing secondary spread of pink salmon in Northern Norway. Our study shows that SDMs and genomic data can reveal species distribution determinants and provide indicators to aid in post-control measures and potential inferences of their success. Methods 3RAD library preparation and sequencing: We prepared RADseq libraries using the Adapterama III library preparation protocol of Bayona-Vásquez et al., (2019; their Supplemental File SI). For each sample, ~40-100 ng of genomic DNA were digested for 1 h at 37 °C in a solution with 1.5 µl of 10x Cutsmart® buﬀer, 0.25 µl (NEB®) of Read 1 enzyme (MspI) at 20 U/µl, 0.25 µl of Read 2 enzyme (BamHI-HF) at 20 U/µl, 0.25 µl of Read 1 adapter dimer-cutting enzyme (ClaI) at 20 U/ µl, 1 µl of i5Tru adapter at 2.5 µM, 1 µl of i7Tru adapter at 2.5 µM and 0.75 µl of dH2O. After digestion/ligation, samples were pooled and cleaned with 1.2x Sera-Mag SpeedBeads (Fisher Scientiifc™) in a 1.2:1 (SpeedBeads:DNA) ratio, and we eluted cleaned DNA in 60 µL of TLE. An enrichment PCR of each sample was carried with 10 µl of 5x Kapa Long Range Buﬀer (Kapa Biosystems, Inc.), 0.25 µl of KAPA LongRange DNA Polymerase at 5 U/µl, 1.5 µl of dNTPs mix (10 mM each dNTP), 3.5 µl of MgCl2 at 25 mM, 2.5 µl of iTru5 primer at 5 µM, 2.5 µl of iTru7 primer at 5 µM and 5 µl of pooled DNA. The i5 and i7 adapters ligated to each sample using a unique combination (2 i5 X 1 i7 indexes). The temperature conditions for PCR enrichment were 94 °C for 2 min of initial denaturation, followed by 10 cycles of 94 °C for 20 sec, 57 °C for 15 sec and 72° for 30 sec, and a final cycle of 72 °C for 5 min. The enriched samples were each cleaned and quantified with a Quantus™ Fluorometer. Cleaned, indexed and quantified library pools were pooled to equimolar concentrations and were sent to the Norwegian Sequencing Centre (NSC) for quality control and subsequent final size selection using a one-sided bead clean-up (0.7:1 ratio) to capture 550 bp +/- 10% fragments, and the final paired-end (PE) 150 bp sequencing on one lane each of the Illumina HiSeq 4000 platform. Data filtering: We filtered genotype data and characterized singleton SNP loci and multi-site variants (MSVs) using filtering procedures and custom scripts available in scripts available in STACKS Workflow v.2 (https://github.com/enormandeau/stacks_workflow). First, we filtered the ‘raw’ VCF file keeping only SNPs that (i) showed a minimum depth of four (-m 4), (ii) were called in at least 80% of the samples in each site (-p 80) and (iii) and for which at least two samples had the rare allele i.e., Minor Allele Sample (MAS; -S 2), using the python script 05_filter_vcf_fast.py. Second, we exclude those samples with more than 20% missing genotypes from the data set. Third, we calculated pairwise relatedness between samples with the Yang et al., (2010) algorithm and individual-level heterozygosity in vcftools v.0.1.17 (Danecek et al., 2010). Additionally, we calculated pairwise kinship coefficients among individuals using the KING-robust method (Manichaikul et al., 2010) with the R package SNPRelate v.1.28.0 (Zheng et al., 2012). Then, we estimated genotyping error rates between technical replicates using the software tiger v1.0 (Bresadola et al., 2020). Finally, we removed one of the pair of closely related individuals exhibiting the higher level of missing data along with samples that showed extremely low heterozygosity (< -0.2) from graphical observation of individual-level heterozygosity per sampling population. Fourth, we conducted a secondary dataset filtering step using 05_filter_vcf_fast.py, keeping the above-mentioned data filtering cut-off parameters (i.e., -m = 4; -p = 80; -S = 3). Fifth, we calculated a suit of four summary statistics to discriminate high-confidence SNPs (singleton SNPs) from SNPs exhibiting a duplication pattern (duplicated SNPs; MSVs): (i) median of allele ratio in heterozygotes (MedRatio), (ii) proportion of heterozygotes (PropHet), (iii) proportion of rare homozygotes (PropHomRare) and (iv) inbreeding coefficient (FIS). We calculated each parameter from the filtered VCF file using the python script 08_extract_snp_duplication_info.py. The four parameters calculated for each locus were plotted against each other to visualize their distribution across all loci using the R script 09_classify_snps.R. Based on the methodology of McKinney et al. (2017) and by plotting different combinations of each parameter, we graphically fixed cut-offs for each parameter. Sixth, we then used the python script 10_split_vcf_in_categories.py for classify SNPs to generate two separate datasets: the “SNP dataset,” based on SNP singletons only, and the “MSV dataset,” based on duplicated SNPs only, which we excluded from further analyses. Seventh, we postfiltered the SNP dataset by keeping all unlinked SNPs within each 3RAD locus using the 11_extract_unlinked_snps.py script with a minimum difference of 0.5 (-diff_threshold 0.5) and a maximum distance 1,000 bp (-max_distance 1,000). Then, for the SNP dataset, we filtered out SNPs that were located in unplaced scaffolds i.e., contigs that were not part of the 26 chromosomes of the pink salmon genome.
MLB Batting Data (2015-2024)
kaggle.com
zip
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josue FernandezC (2025). MLB Batting Data (2015-2024) [Dataset]. https://www.kaggle.com/datasets/josuefernandezc/mlb-hitting-data-2015-2024
Explore at:
zip(272240 bytes)Available download formats
Dataset updated
Sep 29, 2025
Authors
Josue FernandezC
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MLB Batting Stats (2015–2024)

📝Description

This dataset contains scraped Major League Baseball (MLB) batting statistics from Baseball Reference for the seasons 2015 through 2024. It was collected using a custom Python scraping script and then cleaned and processed in SQL for use in analytics and machine learning workflows.

The data provides a rich view of offensive player performance across a decade of MLB history. Each row represents a player’s season, with key batting metrics such as Batting Average (BA), On-Base Percentage (OBP), Slugging (SLG), OPS, RBI, and Games Played (G). This dataset is ideal for sports analytics, predictive modeling, and trend analysis.

⚙️Data Collection (Python)

Data was scraped directly from Baseball Reference using a Python script that:

Sent HTTP requests with browser-like headers to avoid request blocking.

Parsed HTML tables with pandas.read_html().

Added a Year column for each season.

Cleaned player names by removing symbols (#, *).

Kept summary rows for players who appeared on multiple teams/leagues.

Converted numeric fields and filled missing values with zeros.

Exported both raw and cleaned CSVs for each year.

🧹Data Cleaning (SQL)

After scraping, the raw batting tables were uploaded into BigQuery and further cleaned:

Null values removed – Rows missing key fields (Player, BA, OBP, SLG, OPS, Pos) were excluded.

Duplicate records handled – Identified duplicate player–year–league entries and kept only one instance.

Minimum playing threshold applied – Players with fewer than 100 at-bats were removed to focus on meaningful season-long contributions.

The final cleaned table (cleaned_batting_stats) provides consistent, duplicate-free player summaries suitable for analytics.

📊Dataset Structure

Columns include: - Player – Name of the player - Year – Season year - Age – Age during the season - Team – Team code (2TM for multiple teams) - Lg – League (AL, NL, or 2LG) - G – Games played - AB, H, 2B, 3B, HR, RBI – Core batting stats - BA, OBP, SLG, OPS – Rate statistics - Pos – Primary fielding position

🚀Potential Uses

League Trends: Compare batting averages and OPS across seasons.

Top Performer Analysis: Identify the best hitters in different eras.

Predictive Modeling: Forecast future player stats using regression or ML.

Clustering: Group players into offensive archetypes.# ## ## ##

Sports Dashboards: Build interactive Tableau/Plotly dashboards for fans and analysts.

📌Acknowledgments

Raw data sourced from Baseball Reference .

Inspired by open baseball datasets and community-driven sports analytics.
f
Socio-demographic and economic characteristics of respondents.
plos.figshare.com
figshare.com
xls
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle (2023). Socio-demographic and economic characteristics of respondents. [Dataset]. http://doi.org/10.1371/journal.pdig.0000345.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000345.t001
Dataset updated
Oct 17, 2023
Dataset provided by
PLOS Digital Health
Authors
Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Socio-demographic and economic characteristics of respondents.
Missing and Unaccounted-for People in Mexico (1960s–2025)
figshare.com
txt
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montserrat Mora (2025). Missing and Unaccounted-for People in Mexico (1960s–2025) [Dataset]. http://doi.org/10.6084/m9.figshare.28283000.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28283000.v6
Dataset updated
Oct 2, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Montserrat Mora
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Mexico
Description
This project provides a comprehensive dataset of over 130,000 missing and unaccounted-for people in Mexico from the 1960s to 2025. The dataset is sourced from the publicly available records on the RNPDO website and represents individuals who were actively missing as of the date of collection (October 1, 2025). To protect individual identities, personal identifiers, such as names, have been removed.Dataset Features:The data has been cleaned and translated to facilitate analysis by a global audience.Fields include:SexDate of birthDate of incidenceState and municipality of the incidentData spans over six decades, offering insights into trends and regional disparities.Additional Materials:Python Script: A Python script to generate customizable visualizations based on the dataset. Users can specify the state to generate tailored charts.Sample Chart: An example chart showcasing the evolution of missing persons per 100,000 inhabitants in Mexico between 2006 and 2025.Requirements File: A requirements.txt file listing the necessary Python libraries to run the script seamlessly.This dataset and accompanying tools aim to support researchers, policymakers, and journalists in analyzing and addressing the issue of missing persons in Mexico.
Cleaned Pathogen Detection Dataset
kaggle.com
zip
Updated Sep 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Sajeel Haider (2024). Cleaned Pathogen Detection Dataset [Dataset]. https://www.kaggle.com/datasets/sajkazmi/cleaned-pathogen-detection-dataset
Explore at:
zip(15434327 bytes)Available download formats
Dataset updated
Sep 14, 2024
Authors
Syed Sajeel Haider
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset is a cleaned and processed version of the original Pathogen Detection - Salmonella enterica dataset, which contains over half a million records of Salmonella enterica isolates from various sources. The data details pathogen strain information, geographical locations, antimicrobial resistance genotypes, and more. This ready-to-analyze version has been cleaned to address missing values, redundant data, and formatting inconsistencies, allowing future users to jump straight into analysis without worrying about preprocessing.

Changes Made: - Handling Missing Values: Columns like Strain, Location, Isolation type, and SNP cluster were cleaned by removing rows where critical information was missing, ensuring data completeness for meaningful analysis.

Redundant Data Removal: Columns with excessive missing values, such as Serovar and Isolation source, were removed to streamline the dataset and reduce noise. Additionally, duplicate rows were eliminated based on unique isolate identifiers.

Data Type Consistency: The Create date column was converted to proper datetime format, facilitating easier manipulation for time-based analyses. Numerical columns like Min-same and Min-diff were forward-filled to preserve continuity in the dataset.

Data Integrity Checks: Ensured that the key identifiers like Isolate, BioSample, and AMR genotypes remained consistent, improving the overall reliability and usability of the dataset.

Use Cases: The cleaned dataset is ideal for a range of studies, such as:

Epidemiological Studies: Understanding the geographical distribution and movement of Salmonella enterica strains, and identifying outbreak sources based on SNP cluster data.

Antimicrobial Resistance (AMR) Research: Investigating the prevalence of antimicrobial resistance genotypes across different strains, enabling insights into global AMR trends.

Public Health Surveillance: Tracking and monitoring pathogen diversity and trends over time, assisting in timely interventions and policy-making.

Environmental and Food Safety Analysis: Studying isolates from various sources (e.g., food, environmental samples) to improve food safety protocols and environmental health.

Format: The dataset is provided in CSV format, ready for immediate use in Python, R, or other data analysis tools. Users can quickly apply this dataset for pathogen surveillance, AMR analysis, or any other research requiring high-quality microbial data.
t
ESA CCI SM MODELFREE Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
researchdata.tuwien.at
zip
Updated Jan 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Diane Duchemin; Nemesio Rodriguez-Fernandez; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo; Diane Duchemin; Nemesio Rodriguez-Fernandez; Diane Duchemin; Nemesio Rodriguez-Fernandez; Diane Duchemin; Nemesio Rodriguez-Fernandez (2025). ESA CCI SM MODELFREE Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/svr1r-27j77
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/svr1r-27j77
Dataset updated
Jan 27, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Diane Duchemin; Nemesio Rodriguez-Fernandez; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo; Diane Duchemin; Nemesio Rodriguez-Fernandez; Diane Duchemin; Nemesio Rodriguez-Fernandez; Diane Duchemin; Nemesio Rodriguez-Fernandez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Abstract

The MODELFREE product of the ESA CCI SM v9.1 science data suite provides - similar to the COMBINED product - global, harmonized daily satellite soil moisture measurements from both radar and radiometer observations. This product contains soil moisture estimates at 0.25-degree spatial resolution, and covers the period from 2002-2023. Soil moisture is derived from observations of 13 different active and passive satellites operating across various frequency bands (K, C, X, and L-band). Unlike the COMBINED product, for which soil moisture fields from the GLDAS Noah model dataset are used to harmonize individual satellite sensor measurements, the MODELFREE product utilizes a satellite-only scaling reference dataset (Madelon et al., 2022). This reference incorporates gap-filled soil moisture derived from AMSR-E (2002-2010) and from intercalibrated SMAP/SMOS brightness temperature data (2010-2023). The merging algorithm employed is consistent with that of the v9.1 COMBINED product. The new scaling reference leads to significantly different absolute soil moisture values, especially in latitudes above 60 °N. Data from the SMMR, SSMI and ERS missions are not included in this product.

This product is in its early development stage and should be used with caution, as it may contain incomplete or unvalidated data.

Summary

First version of a model-independent version of the ESA CCI SM COMBINED product

2002-2023, global, 0.25 deg. resolution

GLDAS Noah (model) is replaced with a purely satellite-based scaling reference

Different absolute value range compared to the COMBINED product is expected due to the different scaling reference used

Known issues

A temporal inconsistency is observed between the AMSR-E and SMOS period (at 01-2010). This can affect long-term trends in the data

In the period from 01-2002 to 06-2002 no data are available above 37 °N and below 37 °S respectively (all measurements in this period are from the TRMM Microwave Imager)

Technical Details

The dataset provides global daily estimates for the 2002-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_MODELFREE-YYYYMMDD000000-fv09.1.nc

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of satellite observations. Derived using triple collocation analysis.

dn_flag: (int) Indicator for satellite orbit(s) used in the retrieval (day/nighttime). 1=day, 2=night, 3=both

flag: (int) Indicator for data quality / missing data indicator. For more details, see netcdf attributes.

freqbandID: (int) Indicator for frequency band(s) used in the retrieval. For more details, see netcdf attributes.

mode: (int) Indicator for satellite orbit(s) used in the retrieval (ascending, descending)

sensor: (int) Indicator for satellite sensor(s) used in the retrieval. For more details, see netcdf attributes.

t0: (float) Representative time stamp, based on overpass times of all merged satellites.

Additional information for each variable is given in the netCDF attributes.

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

R. Madelon et al., “Toward the Removal of Model Dependency in Soil Moisture Climate Data Records by Using an L-Band Scaling Reference," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 831-848, 2022, doi: 10.1109/JSTARS.2021.3137008.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM RZSM Root-Zone Soil Moisture Record
10.48436/v8cwj-jk556
2
ESA CCI SM GAPFILLED Surface Soil Moisture Record
10.48436/hcm6n-t4m35
Overwatch 2 statistics
kaggle.com
zip
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics/code
Explore at:
zip(67546 bytes)Available download formats
Dataset updated
Jun 27, 2023
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
LCK Spring 2024 Players Statistics
kaggle.com
zip
Updated Dec 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Rozado (2024). LCK Spring 2024 Players Statistics [Dataset]. https://www.kaggle.com/datasets/lukasrozado/lck-spring-2024-players-statistics/code
Explore at:
zip(156203 bytes)Available download formats
Dataset updated
Dec 1, 2024
Authors
Lukas Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides an in-depth look at the League of Legends Champions Korea (LCK) Spring 2024 season. It includes detailed metrics for players, champions, and matches, meticulously cleaned and organized for easy analysis and modeling.

Data Collection

The data was collected using a combination of manual efforts and automated web scraping tools. Specifically:

Source: Data was gathered from Gol.gg, a well-known platform for League of Legends statistics. Automation: Web scraping was performed using Python libraries like BeautifulSoup and Selenium to extract information on players, matches, and champions efficiently. Focus: The scripts were designed to capture relevant performance metrics for each player and champion used during the Spring 2024 split.

Data Cleaning and Processing

The raw data obtained from web scraping required significant preprocessing to ensure its usability. The following steps were taken:

Handling Raw Data:

Extracted key performance indicators like KDA, Win Rate, Games Played, and Match Durations from the source. Normalized inconsistent formats for metrics such as win rates (e.g., removing %) and durations (e.g., converting MM:SS to total seconds).

Data Cleaning:

Removed duplicate rows and ensured no missing values. Fixed inconsistencies in player and champion names to maintain uniformity. Checked for outliers in numerical metrics (e.g., unrealistically high KDA values).

Data Organization:

Created three separate tables for better data management:

Player Statistics: General player performance metrics like KDA, win rates, and average kills. Champion Statistics: Data on games played, win rates, and KDA for each champion. Match List: Details of each match, including players, champions, and results. Added sequential Player IDs to connect the three datasets, facilitating relational analysis. Date Formatting: Converted all date fields to the DD/MM/YYYY format for consistency. Removed irrelevant time data to focus solely on match dates.

Tools and Libraries Used

The following tools were used throughout the project:

Python: Libraries: Pandas, NumPy for data manipulation; BeautifulSoup, Selenium for web scraping. Visualization: Matplotlib, Seaborn, Plotly for potential analysis. Excel: Consolidated final datasets into a structured Excel file with multiple sheets. Data Validation: Used Python scripts to check for missing data, validate numerical columns, and ensure data consistency. Kaggle Integration: Cleaned datasets and a comprehensive README file were prepared for direct upload to Kaggle.

Applications

This dataset is ready for use in: Exploratory Data Analysis (EDA): Visualize player and champion performance trends across matches. Machine Learning: Develop models to predict match outcomes based on player and champion statistics. Sports Analytics: Gain insights into champion picks, win rates, and individual player strategies.

Acknowledgments

This dataset was made possible by the extensive statistics available on Gol.gg and the use of Python-based web scraping and data cleaning methodologies. It is shared under the CC BY 4.0 License to encourage reuse and collaboration.
Netflix Movies and TV Shows Dataset Cleaned(excel)
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description
This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.
Wireless Sensor Network Dataset
kaggle.com
zip
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rehan Adil Abbasi (2024). Wireless Sensor Network Dataset [Dataset]. https://www.kaggle.com/datasets/rehanadilabbasi/wireless-sensor-network-dataset/code
Explore at:
zip(258458 bytes)Available download formats
Dataset updated
Jun 19, 2024
Authors
Rehan Adil Abbasi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Basic Information:

Number of entries: 374,661 Number of features: 19 Data Types:

15 integer columns 3 float columns 1 object column (label) Column Names:

id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:

id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.

Statistical Summary:

The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.

Next Steps:

Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step.

i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)

ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.

Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now.

Class Distribution Analysis The distribution of the classes within the dataset is as follows:

Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".
Police Data
kaggle.com
zip
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science Lovers (2025). Police Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/police-data
Explore at:
zip(543415 bytes)Available download formats
Dataset updated
Aug 2, 2025
Authors
Data Science Lovers
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
📹Project Video available on YouTube - https://youtu.be/GyUbo45mVSE

🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

Police Check-posts Data

This dataset contains detailed records of police traffic stops. Each row represents a single stop, with information about the date, time, driver demographics, the reason for the stop, whether a search was conducted, and the outcome. It can be useful for analysing traffic stop patterns, demographic trends, law enforcement behaviour, and correlations with violations or arrests.

Using this dataset, we answered multiple questions with Python in our Project.

Q.1) Instruction ( For Data Cleaning ) - Remove the column that only contains missing values

Q.2) For Speeding , were Men or Women stopped more often ?

Q.3) Does gender affect who gets searched during a stop ?

Q.4) What is the mean stop_duration ?

Q.5) Compare the age distributions for each violation

These are the main Features/Columns available in the dataset :

1) stop_date – The date on which the traffic stop occurred.

2) stop_time – The exact time when the stop took place.

3) driver_gender – Gender of the driver (M for male, F for female).

4) driver_age_raw – Raw recorded birth year of the driver.

5) driver_age – Calculated or cleaned driver’s age at the time of the stop.

6) driver_race – Race or ethnicity of the driver (e.g., White, Black, Asian, Hispanic).

7) violation_raw – Original recorded reason for the stop.

8) violation – Categorized reason for the stop (e.g., Speeding, Other).

9) search_conducted – Boolean value indicating whether a search was performed (True/False).

10) search_type – Type of search conducted, if any (e.g., vehicle search, driver search).

11) stop_outcome – The result of the stop (e.g., Citation, Arrest, Warning).

12) is_arrested – Boolean value indicating if the driver was arrested (True/False).

13) stop_duration – Approximate length of the stop (e.g., 0-15 Min, 16-30 Min).

14) drugs_related_stop – Boolean value indicating if the stop was related to drugs (True/False).

Facebook

Twitter

Click to copy link

Link copied

Cite

Zuhair khan (2025). Cleaning Practice with Errors & Missing Values [Dataset]. https://www.kaggle.com/datasets/zuhairkhan13/cleaning-practice-with-errors-and-missing-values

Cleaning Practice with Errors & Missing Values

Data Cleaning Practice Dataset (with Errors & Missing Values)

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 5, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Zuhair khan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.

It includes 500 rows of simulated employee data with intentional errors such as:

Missing values in Age and Salary

Typos in email addresses (@gamil.com)

Inconsistent city name casing (e.g., lahore, Karachi)

Extra spaces in department names (e.g., " HR ")

✅ Skills You Can Practice:

Detecting and handling missing data

String cleaning and formatting

Removing duplicates

Validating email formats

Standardizing categorical data

You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.

Clear search

Close search

Google apps

Main menu

1	ESA CCI SM RZSM Root-Zone Soil Moisture Record	10.48436/v8cwj-jk556
2	ESA CCI SM GAPFILLED Surface Soil Moisture Record	10.48436/hcm6n-t4m35

Cleaning Practice with Errors & Missing Values

Medical Clean Dataset

Science Education Research Topic Modeling Dataset

Student Performance and Learning Behavior Dataset for Educational Analytics

Compilation of surface water diversion sites and daily withdrawals in the...

Tour Recommendation Model

Dataset Description for Tour Recommendation Model

Context and Methodology:

Technical Details:

Further Details:

Cleaned Netflix Dataset for EDA

Data from: Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric...

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

Genome-wide SNP datasets for the non-native pink salmon in Norway

MLB Batting Data (2015-2024)

MLB Batting Stats (2015–2024)

📝Description

⚙️Data Collection (Python)

🧹Data Cleaning (SQL)

📊Dataset Structure

🚀Potential Uses

📌Acknowledgments

Socio-demographic and economic characteristics of respondents.

Missing and Unaccounted-for People in Mexico (1960s–2025)

Cleaned Pathogen Detection Dataset

ESA CCI SM MODELFREE Long-term Climate Data Record of Surface Soil Moisture...

Abstract

Summary

Known issues

Technical Details

Software to open netCDF files

References

Related Records

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

LCK Spring 2024 Players Statistics

Data Collection

Data Cleaning and Processing

Handling Raw Data:

Data Cleaning:

Data Organization:

Tools and Libraries Used

Applications

Acknowledgments

Netflix Movies and TV Shows Dataset Cleaned(excel)

Wireless Sensor Network Dataset

Police Data

📹Project Video available on YouTube - https://youtu.be/GyUbo45mVSE

🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

Police Check-posts Data

Using this dataset, we answered multiple questions with Python in our Project.

These are the main Features/Columns available in the dataset :

Cleaning Practice with Errors & Missing Values

Data Cleaning Practice Dataset (with Errors & Missing Values)

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing