100+ datasets found

d
Python Script for Cleaning Alum Dataset
search.dataone.org
hydroshare.org
Updated Oct 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
Explore at:
Dataset updated
Oct 18, 2025
Dataset provided by
Hydroshare
Authors
saikumar payyavula; Jeff Sadler
Description
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
Z
Data Cleaning, Translation & Split of the Dataset for the Automatic...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
Explore at:
Dataset updated
Aug 8, 2022
Authors
Köhler, Juliane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
h
codeparrot-clean
huggingface.co
Updated Dec 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2021). codeparrot-clean [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2021
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
Description
CodeParrot 🦜 Dataset Cleaned

What is it?

A dataset of Python files from Github. This is the deduplicated version of the codeparrot.

Processing

The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

Deduplication Remove exact matches

Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)

For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
Netflix Movies and TV Shows Dataset Cleaned(excel)
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description
This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.
Pandas Practice Dataset
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
Explore at:
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?
Z
NoCORA - Northern Cameroon Observed Rainfall Archive
data.niaid.nih.gov
zenodo.org
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel (2024). NoCORA - Northern Cameroon Observed Rainfall Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10156437
Explore at:
Dataset updated
Jul 10, 2024
Dataset provided by
Center for International Forestry Research
Centre de Coopération Internationale en Recherche Agronomique pour le Développement
Authors
Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel
Area covered
Cameroon, North Region
Description
Description: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:

Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:

v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
Dirty Dataset to practice Data Cleaning
kaggle.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
Explore at:
zip(1235 bytes)Available download formats
Dataset updated
May 20, 2024
Authors
Martin Kanju
Description
Dataset

This dataset was created by Martin Kanju

Released under Other (specified in description)

Contents
H
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
dataverse.harvard.edu
figshare.com
Updated Mar 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SXMSDZ
Dataset updated
Mar 21, 2022
Dataset provided by
Harvard Dataverse
Authors
Elizabeth Szkirpan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Saccade data cleaning
figshare.com
txt
Updated Mar 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Annie Campbell (2022). Saccade data cleaning [Dataset]. http://doi.org/10.6084/m9.figshare.4810471.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4810471.v1
Dataset updated
Mar 26, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Annie Campbell
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
python scripts and functions needed to view and clean saccade data
E
A Replication Dataset for Fundamental Frequency Estimation
live.european-language-grid.eu
data.niaid.nih.gov
json
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7808
Explore at:
jsonAvailable download formats
Dataset updated
Oct 19, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data
ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]
noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]
synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]
noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.
Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.
The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.
Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5244636?locale=lv
Explore at:
unknown(1052407809)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
o
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
explore.openaire.eu
data.europa.eu
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4044635
Dataset updated
Apr 26, 2021
Authors
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.

BI intro to data cleaning eda and machine learning

kaggle.com

zip

Updated Nov 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions

Explore at:

zip(9961 bytes)Available download formats

Dataset updated

Nov 17, 2025

Authors

Walekhwa Tambiti Leo Philip

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

Background

Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

Admissions decision-making
Academic support strategies
Overall program impact and ROI

Your Mission

Answer this central question:

“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

Key Strategic Areas

You are required to analyze and provide actionable insights for the following three areas:

1. Admissions Optimization

Should entry exams remain the primary admissions filter?

Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

✅ Deliverables:

Feature importance ranking for predicting Python and DB scores
Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
Business rationale and risk analysis

2. Curriculum Support Strategy

Are there at-risk student groups who need extra support?

Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

✅ Deliverables:

At-risk segment identification
Support program design (e.g., prep course, mentoring)
Expected outcomes, costs, and KPIs

3. Resource Allocation & Program ROI

How can we allocate resources for maximum student success?

Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

✅ Deliverables:

Performance drivers
Student segmentation
Resource allocation plan and ROI projection

🛠️ Dataset Overview

Column	Description
`fNAME`, `lNAME`	Student first and last name
`Age`	Student age (21–71 years)
`gender`	Gender (standardized as "Male"/"Female")
`country`	Student’s country of origin
`residence`	Student housing/residence type
`entryEXAM`	Entry test score (28–98)
`prevEducation`	Prior education (High School, Diploma, etc.)
`studyHOURS`	Total study hours logged
`Python`	Final Python exam score
`DB`	Final Database exam score

📊 Dataset

You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

Raw Dataset (Recommended for Full Project)

Download: bi.csv

This dataset includes common data quality challenges:

Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy

Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

Cleaned Dataset (Optional Shortcut)

Download: cleaned_bi.csv

This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

London 'Data' Job Posts, Raw and Clean.
kaggle.com
zip
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EdRenton (2022). London 'Data' Job Posts, Raw and Clean. [Dataset]. https://www.kaggle.com/datasets/edrenton/job-post-data-raw-cleaned-using-sql/discussion
Explore at:
zip(260018 bytes)Available download formats
Dataset updated
Dec 15, 2022
Authors
EdRenton
Area covered
London
Description
This is a dataset I extracted from UK job posts in London, from reed.co.uk. These jobs have the keyword 'data' in them. I extracted the data using python. I created a loop to extract over 400 pages, allowing me to scrape over 10,000 job posts.
BBC-News Dataset
kaggle.com
zip
Updated Aug 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Kirpekar (2020). BBC-News Dataset [Dataset]. https://www.kaggle.com/datasets/sahilkirpekar/bbcnews-dataset/code
Explore at:
zip(2177968 bytes)Available download formats
Dataset updated
Aug 11, 2020
Authors
Sahil Kirpekar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hello data people ! 😄

This is the BBC news dataset (cleaned version) which I have uploaded after my previous dataset post. The original dataset downloaded from the UCI Machine Learning Repository was unclean. The dataset was cleaned by extracting the keywords from the description column into the noisy 'keys' column data.

About the Dataset 🔢

The BBC news dataset consists of the following data 1. # - News ID. 2. descr - description/detail of the news provided. 3. tags - the tags/keywords related to the corresponding news in the 'descr' label.
Z
#PraCegoVer dataset
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Institute of Computing, University of Campinas
Authors
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
Indian Housing Datasets -ML-ready data for - Citys
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishal Baghel (2025). Indian Housing Datasets -ML-ready data for - Citys [Dataset]. https://www.kaggle.com/datasets/vishalbaghel28/indian-housing-datasets-ml-ready-data-for-citys
Explore at:
zip(153559 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
Vishal Baghel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🇮🇳 Indian Housing Datasets — Ready-to-use for ML & Data Analysis

This dataset provides clean, ready-to-use Indian housing data for: - 🏙️ Ahmedabad - 🏙️ Gurugram - 🏙️ Mumbai

Each dataset includes features like: - Property size (sqft) - Location & locality - Price - Number of bedrooms - Furnishing details - Property type (apartment, villa, etc.) - Age of property

All datasets are formatted in CSV for quick loading and analysis in Python, Pandas, or any ML pipeline.

📦 Python Package (for easy access)

You can directly load these datasets using my PyPI library:

pip install india-housing-datasets
D
CompuCrawl: Full database and code
dataverse.nl
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
Explore at:
Unique identifier
https://doi.org/10.34894/OBVAOY
Dataset updated
Sep 23, 2025
Dataset provided by
DataverseNL
Authors
Richard Haans; Richard Haans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
h
codeparrot-train-near-deduplication
huggingface.co
Updated Dec 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). codeparrot-train-near-deduplication [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2022
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
Description
CodeParrot 🦜 Dataset after near deduplication (train)

Dataset Description

A dataset of Python files from Github. We performed near deduplication of this dataset split codeparrot-clean-train from codeparrot-clean. Exact deduplication can miss a fair amount of nearly identical files. We used MinHash with a Jaccard threshold (default=0.85) to create duplicate clusters. Then these clusters are reduced to unique files based on the exact Jaccard similarity. Fore more details… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication.

Facebook

Twitter

Click to copy link

Link copied

Cite

saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e

Python Script for Cleaning Alum Dataset

Explore at:

Dataset updated

Oct 18, 2025

Dataset provided by

Hydroshare

Authors

saikumar payyavula; Jeff Sadler

Description

This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.

Clear search

Close search

Google apps

Main menu

Python Script for Cleaning Alum Dataset

Medical Clean Dataset

Data Cleaning, Translation & Split of the Dataset for the Automatic...

codeparrot-clean

Netflix Movies and TV Shows Dataset Cleaned(excel)

Pandas Practice Dataset

NoCORA - Northern Cameroon Observed Rainfall Archive

Dirty Dataset to practice Data Cleaning

Dataset

Contents

Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

Saccade data cleaning

A Replication Dataset for Fundamental Frequency Estimation

Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

BI intro to data cleaning eda and machine learning

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Background

Your Mission

Key Strategic Areas

1. Admissions Optimization

2. Curriculum Support Strategy

3. Resource Allocation & Program ROI

🛠️ Dataset Overview

📊 Dataset

Raw Dataset (Recommended for Full Project)

Cleaned Dataset (Optional Shortcut)

London 'Data' Job Posts, Raw and Clean.

BBC-News Dataset

Hello data people ! 😄

About the Dataset 🔢

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

Indian Housing Datasets -ML-ready data for - Citys

🇮🇳 Indian Housing Datasets — Ready-to-use for ML & Data Analysis

📦 Python Package (for easy access)

CompuCrawl: Full database and code

codeparrot-train-near-deduplication

Python Script for Cleaning Alum Dataset