100+ datasets found

d
Python Script for Cleaning Alum Dataset
search.dataone.org
hydroshare.org
Updated Oct 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
Explore at:
Dataset updated
Oct 18, 2025
Dataset provided by
Hydroshare
Authors
saikumar payyavula; Jeff Sadler
Description
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Data Cleaning - Feature Imputation
kaggle.com
zip
Updated Aug 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mr.Machine (2022). Data Cleaning - Feature Imputation [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-cleaning-feature-imputation
Explore at:
zip(116097 bytes)Available download formats
Dataset updated
Aug 13, 2022
Authors
Mr.Machine
Description
Data Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.

Here some info in details :Feature Engineering - Handling Missing Value

Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.
r
Data from: Data Cleaning and AutoML: Would an Optimizer Choose to Clean?
resodate.org
Updated Aug 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Neutatz; Binger Chen; Yazan Alkhatib; Jingwen Ye; Ziawasch Abedjan (2022). Data Cleaning and AutoML: Would an Optimizer Choose to Clean? [Dataset]. http://doi.org/10.14279/depositonce-15981
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-15981
Dataset updated
Aug 5, 2022
Dataset provided by
DepositOnce
Technische Universität Berlin
Authors
Felix Neutatz; Binger Chen; Yazan Alkhatib; Jingwen Ye; Ziawasch Abedjan
Description
Data cleaning is widely acknowledged as an important yet tedious task when dealing with large amounts of data. Thus, there is always a cost-benefit trade-off to consider. In particular, it is important to assess this trade-off when not every data point and data error is equally important for a task. This is often the case when statistical analysis or machine learning (ML) models derive knowledge about data. If we only care about maximizing the utility score of the applications, such as accuracy or F1 scores, many tasks can afford some degree of data quality problems. Recent studies analyzed the impact of various data error types on vanilla ML tasks, showing that missing values and outliers significantly impact the outcome of such models. In this paper, we expand the setting to one where data cleaning is not considered in isolation but as an equal parameter among many other hyper-parameters that influence feature selection, regularization, and model selection. In particular, we use state-of-the-art AutoML frameworks to automatically learn the parameters that benefit a particular ML binary classification task. In our study, we see that specific cleaning routines still play a significant role but can also be entirely avoided if the choice of a specific model or the filtering of specific features diminishes the overall impact.
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
D
Autonomous Data Cleaning With AI Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Autonomous Data Cleaning With AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/autonomous-data-cleaning-with-ai-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Autonomous Data Cleaning with AI Market Outlook

According to our latest research, the global Autonomous Data Cleaning with AI market size in 2024 reached USD 1.82 billion, reflecting a robust expansion driven by rapid digital transformation across industries. The market is experiencing a CAGR of 25.7% from 2025 to 2033, with forecasts indicating that the market will reach USD 14.4 billion by 2033. This remarkable growth is primarily attributed to the increasing demand for high-quality, reliable data to power advanced analytics and artificial intelligence initiatives, as well as the escalating complexity and volume of data in modern enterprises.

The surge in the adoption of artificial intelligence and machine learning technologies is a critical growth factor propelling the Autonomous Data Cleaning with AI market. Organizations are increasingly recognizing the importance of clean, accurate data as a foundational asset for digital transformation, predictive analytics, and data-driven decision-making. As data volumes continue to explode, manual data cleaning processes have become unsustainable, leading enterprises to seek autonomous solutions powered by AI algorithms. These solutions not only automate error detection and correction but also enhance data consistency, integrity, and usability across disparate systems, reducing operational costs and improving business agility.

Another significant driver for the Autonomous Data Cleaning with AI market is the rising regulatory pressure around data governance and compliance. Industries such as banking, finance, and healthcare are subject to stringent data quality requirements, necessitating robust mechanisms to ensure data accuracy and traceability. AI-powered autonomous data cleaning tools are increasingly being integrated into enterprise data management strategies to address these regulatory challenges. These tools help organizations maintain compliance, minimize the risk of data breaches, and avoid costly penalties, further fueling market growth as regulatory frameworks become more complex and widespread across global markets.

The proliferation of cloud computing and the shift towards hybrid and multi-cloud environments are also accelerating the adoption of Autonomous Data Cleaning with AI solutions. As organizations migrate workloads and data assets to the cloud, ensuring data quality across distributed environments becomes paramount. Cloud-based autonomous data cleaning platforms offer scalability, flexibility, and integration capabilities that are well-suited to dynamic enterprise needs. The growing ecosystem of cloud-native AI tools, combined with the increasing sophistication of data integration and orchestration platforms, is enabling businesses to deploy autonomous data cleaning at scale, driving substantial market expansion.

From a regional perspective, North America continues to dominate the Autonomous Data Cleaning with AI market, accounting for the largest revenue share in 2024. The region’s advanced technological infrastructure, high concentration of AI innovators, and early adoption by large enterprises are key factors supporting its leadership position. However, Asia Pacific is emerging as the fastest-growing regional market, fueled by rapid digitalization, expanding IT investments, and strong government initiatives supporting AI and data-driven innovation. Europe also remains a significant contributor, with increasing adoption in sectors such as banking, healthcare, and manufacturing. Overall, the global market exhibits a broadening geographic footprint, with opportunities emerging across both developed and developing economies.

Component Analysis

The Autonomous Data Cleaning with AI market is segmented by component into Software and Services. The software segment currently holds the largest share of the market, driven by the rapid advancement and deployment of AI-powered data cleaning platforms. These software solutions leverage sophisticated algorithms for anomaly detection, deduplication, data enrichment, and validation, providing organizations with automated tools to ensure data quality at scale. The increasing integration of machine learning and natural language processing (NLP) capabilities further enhances the effectiveness of these platforms, enabling them to address a wide range of data quality issues across structured and unstructured datasets.

The
d
Prediction data from: Machine learning predicts which rivers, streams, and...
datadryad.org
dataone.org
+1more
zip
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro (2023). Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates [Dataset]. http://doi.org/10.5061/dryad.z34tmpgm7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z34tmpgm7
Dataset updated
Dec 10, 2023
Dataset provided by
Dryad
Authors
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro
Time period covered
Sep 27, 2023
Description
This dataset contains model outputs that were analyzed to produce the main results of the paper.
G
Autonomous Data Cleaning with AI Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Autonomous Data Cleaning with AI Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/autonomous-data-cleaning-with-ai-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Autonomous Data Cleaning with AI Market Outlook

According to our latest research, the global Autonomous Data Cleaning with AI market size reached USD 1.68 billion in 2024, with a robust year-on-year growth driven by the surge in enterprise data volumes and the mounting demand for high-quality, actionable insights. The market is projected to expand at a CAGR of 24.2% from 2025 to 2033, which will take the overall market value to approximately USD 13.1 billion by 2033. This rapid growth is fueled by the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, aiming to automate and optimize the data cleaning process for improved operational efficiency and decision-making.

The primary growth driver for the Autonomous Data Cleaning with AI market is the exponential increase in data generation across various industries such as BFSI, healthcare, retail, and manufacturing. Organizations are grappling with massive amounts of structured and unstructured data, much of which is riddled with inconsistencies, duplicates, and inaccuracies. Manual data cleaning is both time-consuming and error-prone, leading businesses to seek automated AI-driven solutions that can intelligently detect, correct, and prevent data quality issues. The integration of AI not only accelerates the data cleaning process but also ensures higher accuracy, enabling organizations to leverage clean, reliable data for analytics, compliance, and digital transformation initiatives. This, in turn, translates into enhanced business agility and competitive advantage.

Another significant factor propelling the market is the increasing regulatory scrutiny and compliance requirements in sectors such as banking, healthcare, and government. Regulations such as GDPR, HIPAA, and others mandate strict data governance and quality standards. Autonomous Data Cleaning with AI solutions help organizations maintain compliance by ensuring data integrity, traceability, and auditability. Additionally, the evolution of cloud computing and the proliferation of big data analytics platforms have made it easier for organizations of all sizes to deploy and scale AI-powered data cleaning tools. These advancements are making autonomous data cleaning more accessible, cost-effective, and scalable, further driving market adoption.

The growing emphasis on digital transformation and real-time decision-making is also a crucial growth factor for the Autonomous Data Cleaning with AI market. As enterprises increasingly rely on analytics, machine learning, and artificial intelligence for business insights, the quality of input data becomes paramount. Automated, AI-driven data cleaning solutions enable organizations to process, cleanse, and prepare data in real-time, ensuring that downstream analytics and AI models are fed with high-quality inputs. This not only improves the accuracy of business predictions but also reduces the time-to-insight, helping organizations stay ahead in highly competitive markets.

From a regional perspective, North America currently dominates the Autonomous Data Cleaning with AI market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology companies, early adopters of AI, and a mature regulatory environment are key factors contributing to North America’s leadership. However, Asia Pacific is expected to witness the highest CAGR over the forecast period, driven by rapid digitalization, expanding IT infrastructure, and increasing investments in AI and data analytics, particularly in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also gradually emerging as promising markets, supported by growing awareness and adoption of AI-driven data management solutions.

Component Analysis

The Autonomous Data Cleaning with AI market is segmented by component into Software and Services. The software segment currently holds the largest market share, driven
e
Data pre-processing and clean-up
paper.erudition.co.in
html
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2025). Data pre-processing and clean-up [Dataset]. https://paper.erudition.co.in/makaut/btech-in-computer-science-and-engineering-artificial-intelligence-and-machine-learning/6/data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Dec 3, 2025
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Data pre-processing and clean-up of Data Mining, 6th Semester , B.Tech in Computer Science & Engineering (Artificial Intelligence and Machine Learning)
Credit Card Approvals (Clean Data)
kaggle.com
zip
Updated Apr 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Cortinhas (2022). Credit Card Approvals (Clean Data) [Dataset]. https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data
Explore at:
zip(19448 bytes)Available download formats
Dataset updated
Apr 25, 2022
Authors
Samuel Cortinhas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a cleaned version of this dataset from UCI machine learning repository on credit card approvals.

Missing values have been filled and feature names and categorical names have been inferred, resulting in more context and it being easier to use.

Your task is to predict which people in the dataset are successful in applying for a credit card.
Credit Score Classification Cleaned Dataset
kaggle.com
zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
irem nur tokuroglu (2024). Credit Score Classification Cleaned Dataset [Dataset]. https://www.kaggle.com/datasets/iremnurtokuroglu/credit-score-classification-cleaned-dataset
Explore at:
zip(4159334 bytes)Available download formats
Dataset updated
Nov 26, 2024
Authors
irem nur tokuroglu
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This notebook is a cleaned version of the Credit Score classification dataset, with comprehensive EDA (Exploratory Data Analysis) already performed. You can apply preprocessing steps to any columns as needed and use it as a training dataset for machine learning and deep learning models

Original dataset is available on: https://www.kaggle.com/datasets/parisrohan/credit-score-classification
Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5244636?locale=lv
Explore at:
unknown(1052407809)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
d
Training data from: Machine learning predicts which rivers, streams, and...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jun 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro (2024). Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates [Dataset]. http://doi.org/10.5061/dryad.m63xsj47s
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.m63xsj47s
Dataset updated
Jun 21, 2024
Dataset provided by
Dryad Digital Repository
Authors
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro
Time period covered
Jan 1, 2023
Description
We assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.Â , This dataset contains data used to train the models., , # Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates

This dataset contains data used to train the models in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to the full set of pre-processed inputs for model training via this repository. We are also providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper, in another Dryad repository: . All code written for the project is available at .

Description of the data and file structure

The files here include:

Trained models, saved in PyTorch Checkpoint format: wotus_model.pth.tar, resource_type_model.pth.tar, cowardin_code_model.pth.tar, ajd_model.pth.tar.

...
D
Data Clean Room For AI Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Clean Room For AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-clean-room-for-ai-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Clean Room for AI Market Outlook

According to our latest research, the global Data Clean Room for AI market size is valued at USD 1.42 billion in 2024, with a robust compound annual growth rate (CAGR) of 27.6% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach USD 13.34 billion. The primary growth factor driving this market is the surge in privacy-centric data collaboration solutions, fueled by regulatory tightening and the need for secure, compliant AI-driven analytics across industries.

The exponential expansion of digital data, coupled with increasing consumer privacy concerns and regulatory frameworks such as GDPR and CCPA, are propelling the adoption of Data Clean Rooms for AI. Organizations are seeking innovative ways to harness data-driven insights without compromising user privacy. Data Clean Rooms address this imperative by enabling secure, privacy-preserving collaboration between multiple parties, facilitating advanced analytics and machine learning without exposing raw data. As a result, enterprises across sectors, including advertising, healthcare, and finance, are rapidly integrating Data Clean Room solutions to unlock the full potential of AI while ensuring compliance and trust.

Another significant growth driver for the Data Clean Room for AI market is the escalating complexity and volume of data generated by digital transformation initiatives. As businesses transition to omnichannel customer engagement and leverage AI for hyper-personalization, the need for secure environments to aggregate and analyze sensitive datasets becomes paramount. Data Clean Rooms enable seamless data interoperability between partners, advertisers, and platforms, which is especially critical for industries like retail and e-commerce where first-party data collaboration can yield substantial competitive advantages. This secure data sharing model not only mitigates risks associated with data breaches but also enhances the accuracy and effectiveness of AI models, further fueling market expansion.

The proliferation of cloud-based solutions and the integration of advanced AI and machine learning capabilities within Data Clean Rooms are catalyzing market growth. Cloud deployment offers scalability, flexibility, and cost-efficiency, making Data Clean Rooms accessible to organizations of all sizes. Moreover, the rapid evolution of AI algorithms and privacy-enhancing technologies, such as federated learning and differential privacy, are enhancing the utility and adoption of Data Clean Rooms. This technological convergence is enabling organizations to extract actionable insights from siloed data sources while maintaining stringent privacy controls, thus accelerating the market's upward trajectory.

Regionally, North America dominates the Data Clean Room for AI market, accounting for the largest revenue share in 2024. This leadership is attributed to the early adoption of privacy-centric technologies, a mature digital ecosystem, and the presence of major technology providers. Europe follows closely, driven by strict data protection regulations and a strong emphasis on ethical AI practices. The Asia Pacific region is expected to exhibit the fastest growth during the forecast period, propelled by rapid digitalization, burgeoning e-commerce, and increasing investments in AI infrastructure. Latin America and the Middle East & Africa are also witnessing steady adoption, particularly in sectors like banking, retail, and healthcare, as organizations seek secure data collaboration frameworks to drive innovation and operational efficiency.

Component Analysis

The Data Clean Room for AI market is segmented by component into Software and Services, each playing a pivotal role in shaping the industry landscape. Software solutions form the backbone of Data Clean Room offerings, providing secure environments for data collaboration, privacy-preserving analytics, and AI model training. These platforms are increasingly leveraging sophisticated cryptographic techniques, access controls, and machine learning algorithms to ensure data security and compliance. The growing demand for customizable, scalable, and interoperable software solutions is driving continuous innovation, with vendors focusing on enhancing user experience, integration capabilities, and automation features to cater to diverse industry requirements.

<p&
f
Data from: Leveraging Supervised Machine Learning Algorithms for System...
acs.figshare.com
zip
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Russell R. Kibbe; Alexandria L. Sohn; David C. Muddiman (2024). Leveraging Supervised Machine Learning Algorithms for System Suitability Testing of Mass Spectrometry Imaging Platforms [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00360.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.4c00360.s001
Dataset updated
Sep 3, 2024
Dataset provided by
ACS Publications
Authors
Russell R. Kibbe; Alexandria L. Sohn; David C. Muddiman
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Quality control and system suitability testing are vital protocols implemented to ensure the repeatability and reproducibility of data in mass spectrometry investigations. However, mass spectrometry imaging (MSI) analyses present added complexity since both chemical and spatial information are measured. Herein, we employ various machine learning algorithms and a novel quality control mixture to classify the working conditions of an MSI platform. Each algorithm was evaluated in terms of its performance on unseen data, validated with negative control data sets to rule out confounding variables or chance agreement, and utilized to determine the necessary sample size to achieve a high level of accurate classifications. In this work, a robust machine learning workflow was established where models could accurately classify the instrument condition as clean or compromised based on data metrics extracted from the analyzed quality control sample. This work highlights the power of machine learning to recognize complex patterns in MSI data and use those relationships to perform a system suitability test for MSI platforms.
(Cleaned) Credit Score Dataset for Classification
kaggle.com
zip
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
muhammed (2023). (Cleaned) Credit Score Dataset for Classification [Dataset]. https://www.kaggle.com/clkmuhammed/creditscoreclassification
Explore at:
zip(8917124 bytes)Available download formats
Dataset updated
Dec 8, 2023
Authors
muhammed
Description
Original Dataset: https://www.kaggle.com/datasets/parisrohan/credit-score-classification

Data Cleaning: https://www.kaggle.com/code/clkmuhammed/credit-score-classification-data-cleaning-project

data cleaned and made ready for machine learning

Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.
i
A Dataset with Adversarial Attacks on Deep Learning in Wireless Modulation...
ieee-dataport.org
Updated Sep 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonios Argyriou (2023). A Dataset with Adversarial Attacks on Deep Learning in Wireless Modulation Classification [Dataset]. https://ieee-dataport.org/documents/dataset-adversarial-attacks-deep-learning-wireless-modulation-classification
Explore at:
Dataset updated
Sep 23, 2023
Authors
Antonios Argyriou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains adversarial attacks on Deep Learning (DL) when it is employed for the classification of wireless modulated communication signals. The attack is executed with an obfuscating waveform that is embedded in the transmitted signal in such a way that prevents the extraction of clean data for training from a wireless eavesdropper. At the same time it allows a legitimate receiver (LRx) to demodulate the data.
o
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
explore.openaire.eu
data.europa.eu
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4044635
Dataset updated
Apr 26, 2021
Authors
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Data Wrangling Market Size, Share, Growth, Forecast, By Component...
verifiedmarketresearch.com
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2025). Data Wrangling Market Size, Share, Growth, Forecast, By Component (Solutions, Services), By Deployment Mode (On-premises, Cloud-based), By End-user Industry (Banking, Financial Services, and Insurance (BFSI), Healthcare & Life Sciences, Retail & E-commerce, IT & Telecom, Government & Public Sector, Manufacturing) [Dataset]. https://www.verifiedmarketresearch.com/product/data-wrangling-market/
Explore at:
Dataset updated
Jun 18, 2025
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2026 - 2032
Area covered
Global
Description
Data Wrangling Market size was valued at USD 1.99 Billion in 2024 and is projected to reach USD 4.07 Billion by 2032, growing at a CAGR of 9.4% during the forecast period 2026-2032.• Big Data Analytics Growth: Organizations are generating massive volumes of unstructured and semi-structured data from diverse sources including social media, IoT devices, and digital transactions. Data wrangling tools become essential for cleaning, transforming, and preparing this complex data for meaningful analytics and business intelligence applications.• Machine Learning and AI Adoption: The rapid expansion of artificial intelligence and machine learning initiatives requires high-quality, properly formatted training datasets. Data wrangling solutions enable data scientists to efficiently prepare, clean, and structure raw data for model training, driving sustained market demand across AI-focused organizations.
g
Salt and Pepper Noise Dataset: Clean vs Noisy Image
gts.ai
json
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2025). Salt and Pepper Noise Dataset: Clean vs Noisy Image [Dataset]. https://gts.ai/dataset-download/salt-and-pepper-noise-datasetclean-vs-noisy-image/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 15, 2025
Dataset provided by
Globose Technology Solutions Private Limited
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Explore the Salt and Pepper Noise Dataset with clean and noisy images for image processing and computer vision research.
d
FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision...
datarade.ai
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2024). FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision Applications | Machine Learning (ML) Data | Deep Learning (DL) Data [Dataset]. https://datarade.ai/data-products/filemarket-dataset-for-face-anti-spoofing-videos-in-compu-filemarket
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jul 10, 2024
Dataset authored and provided by
FileMarket
Area covered
Mali, Malawi, Belarus, Zimbabwe, Sierra Leone, South Africa, Central African Republic, Ukraine, Chad, Congo (Democratic Republic of the)
Description
Live Face Anti-Spoof Dataset

A live face dataset is crucial for advancing computer vision tasks such as face detection, anti-spoofing detection, and face recognition. The Live Face Anti-Spoof Dataset offered by Ainnotate is specifically designed to train algorithms for anti-spoofing purposes, ensuring that AI systems can accurately differentiate between real and fake faces in various scenarios.

Key Features:

Comprehensive Video Collection: The dataset features thousands of videos showcasing a diverse range of individuals, including males and females, with and without glasses. It also includes men with beards, mustaches, and clean-shaven faces. Lighting Conditions: Videos are captured in both indoor and outdoor environments, ensuring that the data covers a wide range of lighting conditions, making it highly applicable for real-world use. Data Collection Method: Our datasets are gathered through a community-driven approach, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that the data is not only diverse but also ethically sourced with full consent from participants, providing reliable and real-world applicable data for training AI models. Versatility: This dataset is ideal for training models in face detection, anti-spoofing, and face recognition tasks, offering robust support for these essential computer vision applications. In addition to the Live Face Anti-Spoof Dataset, FileMarket provides specialized datasets across various categories to support a wide range of AI and machine learning projects:

Object Detection Data: Perfect for training AI in image and video analysis. Machine Learning (ML) Data: Offers a broad spectrum of applications, from predictive analytics to natural language processing (NLP). Large Language Model (LLM) Data: Designed to support text generation, chatbots, and machine translation models. Deep Learning (DL) Data: Essential for developing complex neural networks and deep learning models. Biometric Data: Includes diverse datasets for facial recognition, fingerprint analysis, and other biometric applications. This live face dataset, alongside our other specialized data categories, empowers your AI projects by providing high-quality, diverse, and comprehensive datasets. Whether your focus is on anti-spoofing detection, face recognition, or other biometric and machine learning tasks, our data offerings are tailored to meet your specific needs.

Facebook

Twitter

Click to copy link

Link copied

Cite

saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e

Python Script for Cleaning Alum Dataset

Explore at:

Dataset updated

Oct 18, 2025

Dataset provided by

Hydroshare

Authors

saikumar payyavula; Jeff Sadler

Description

This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.

Clear search

Close search

Google apps

Main menu

Python Script for Cleaning Alum Dataset

Data Cleaning - Feature Imputation

Data from: Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

Medical Clean Dataset

Autonomous Data Cleaning With AI Market Research Report 2033

Autonomous Data Cleaning with AI Market Outlook

Component Analysis

Prediction data from: Machine learning predicts which rivers, streams, and...

Autonomous Data Cleaning with AI Market Research Report 2033

Autonomous Data Cleaning with AI Market Outlook

Component Analysis

Data pre-processing and clean-up

Credit Card Approvals (Clean Data)

Credit Score Classification Cleaned Dataset

Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...

Training data from: Machine learning predicts which rivers, streams, and...

Description of the data and file structure

Data Clean Room For AI Market Research Report 2033

Data Clean Room for AI Market Outlook

Component Analysis

Data from: Leveraging Supervised Machine Learning Algorithms for System...

(Cleaned) Credit Score Dataset for Classification

A Dataset with Adversarial Attacks on Deep Learning in Wireless Modulation...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Data Wrangling Market Size, Share, Growth, Forecast, By Component...

Salt and Pepper Noise Dataset: Clean vs Noisy Image

FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision...

Python Script for Cleaning Alum Dataset