81 datasets found

D
Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-cleaning-tools-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Cleaning Tools Market Outlook

As of 2023, the global market size for data cleaning tools is estimated at $2.5 billion, with projections indicating that it will reach approximately $7.1 billion by 2032, reflecting a robust CAGR of 12.1% during the forecast period. This growth is primarily driven by the increasing importance of data quality in business intelligence and analytics workflows across various industries.

The growth of the data cleaning tools market can be attributed to several critical factors. Firstly, the exponential increase in data generation across industries necessitates efficient tools to manage data quality. Poor data quality can result in significant financial losses, inefficient business processes, and faulty decision-making. Organizations recognize the value of clean, accurate data in driving business insights and operational efficiency, thereby propelling the adoption of data cleaning tools. Additionally, regulatory requirements and compliance standards also push companies to maintain high data quality standards, further driving market growth.

Another significant growth factor is the rising adoption of AI and machine learning technologies. These advanced technologies rely heavily on high-quality data to deliver accurate results. Data cleaning tools play a crucial role in preparing datasets for AI and machine learning models, ensuring that the data is free from errors, inconsistencies, and redundancies. This surge in the use of AI and machine learning across various sectors like healthcare, finance, and retail is driving the demand for efficient data cleaning solutions.

The proliferation of big data analytics is another critical factor contributing to market growth. Big data analytics enables organizations to uncover hidden patterns, correlations, and insights from large datasets. However, the effectiveness of big data analytics is contingent upon the quality of the data being analyzed. Data cleaning tools help in sanitizing large datasets, making them suitable for analysis and thus enhancing the accuracy and reliability of analytics outcomes. This trend is expected to continue, fueling the demand for data cleaning tools.

In terms of regional growth, North America holds a dominant position in the data cleaning tools market. The region's strong technological infrastructure, coupled with the presence of major market players and a high adoption rate of advanced data management solutions, contributes to its leadership. However, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period. The rapid digitization of businesses, increasing investments in IT infrastructure, and a growing focus on data-driven decision-making are key factors driving the market in this region.

As organizations strive to maintain high data quality standards, the role of an Email List Cleaning Service becomes increasingly vital. These services ensure that email databases are free from invalid addresses, duplicates, and outdated information, thereby enhancing the effectiveness of marketing campaigns and communications. By leveraging sophisticated algorithms and validation techniques, email list cleaning services help businesses improve their email deliverability rates and reduce the risk of being flagged as spam. This not only optimizes marketing efforts but also protects the reputation of the sender. As a result, the demand for such services is expected to grow alongside the broader data cleaning tools market, as companies recognize the importance of maintaining clean and accurate contact lists.

Component Analysis

The data cleaning tools market can be segmented by component into software and services. The software segment encompasses various tools and platforms designed for data cleaning, while the services segment includes consultancy, implementation, and maintenance services provided by vendors.

The software segment holds the largest market share and is expected to continue leading during the forecast period. This dominance can be attributed to the increasing adoption of automated data cleaning solutions that offer high efficiency and accuracy. These software solutions are equipped with advanced algorithms and functionalities that can handle large volumes of data, identify errors, and correct them without manual intervention. The rising adoption of cloud-based data cleaning software further bolsters this segment, as it offers scalability and ease of

Restaurant Sales-Dirty Data for Cleaning Training

kaggle.com

Updated Jan 25, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 25, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Restaurant Sales Dataset with Dirt Documentation

Overview

The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

Dataset Use Cases

This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

Columns Description

Column Name	Description	Example Values
`Order ID`	A unique identifier for each order.	`ORD_123456`
`Customer ID`	A unique identifier for each customer.	`CUST_001`
`Category`	The category of the purchased item.	`Main Dishes`, `Drinks`
`Item`	The name of the purchased item. May contain missing values due to data dirt.	`Grilled Chicken`, `None`
`Price`	The static price of the item. May contain missing values.	`15.0`, `None`
`Quantity`	The quantity of the purchased item. May contain missing values.	`1`, `None`
`Order Total`	The total price for the order (`Price * Quantity`). May contain missing values.	`45.0`, `None`
`Order Date`	The date when the order was placed. Always present.	`2022-01-15`
`Payment Method`	The payment method used for the transaction. May contain missing values due to data dirt.	`Cash`, `None`

Key Characteristics

Data Dirtiness:
- Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
- At least one of the following conditions is ensured for each record to identify an item:
  - Item is present.
  - Price is present.
  - Both Quantity and Order Total are present.
- If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
Menu Categories and Items:
- Items are divided into five categories:
  - Starters: E.g., Chicken Melt, French Fries.
  - Main Dishes: E.g., Grilled Chicken, Steak.
  - Desserts: E.g., Chocolate Cake, Ice Cream.
  - Drinks: E.g., Coca Cola, Water.
  - Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

Cleaning Suggestions

Handle Missing Values:
- Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
- Deduce missing Price from Order Total / Quantity if both are available.
Validate Data Consistency:
- Ensure that calculated values (Order Total = Price * Quantity) match.
Analyze Missing Patterns:
- Study the distribution of missing values across categories and payment methods.

Menu Map with Prices and Categories

Category	Item	Price
Starters	Chicken Melt	8.0
Starters	French Fries	4.0
Starters	Cheese Fries	5.0
Starters	Sweet Potato Fries	5.0
Starters	Beef Chili	7.0
Starters	Nachos Grande	10.0
Main Dishes	Grilled Chicken	15.0
Main Dishes	Steak	20.0
Main Dishes	Pasta Alfredo	12.0
Main Dishes	Salmon	18.0
Main Dishes	Vegetarian Platter	14.0
Desserts	Chocolate Cake	6.0
Desserts	Ice Cream	5.0
Desserts	Fruit Salad	4.0
Desserts	Cheesecake	7.0
Desserts	Brownie	6.0
Drinks	Coca Cola	2.5
Drinks	Orange Juice	3.0
Drinks ...

d
Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Dominican Republic, India, United Kingdom, Western Sahara, Norway, Barbados, Jordan, Oman, Sint Maarten (Dutch part), Cook Islands
Description
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.
D
Data Cleansing Tools Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Cleansing Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-cleansing-tools-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Authors
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Cleansing Tools Market Outlook

The global data cleansing tools market size was valued at approximately USD 1.5 billion in 2023 and is projected to reach USD 4.2 billion by 2032, growing at a CAGR of 12.1% from 2024 to 2032. One of the primary growth factors driving the market is the increasing need for high-quality data in various business operations and decision-making processes.

The surge in big data and the subsequent increased reliance on data analytics are significant factors propelling the growth of the data cleansing tools market. Organizations increasingly recognize the value of high-quality data in driving strategic initiatives, customer relationship management, and operational efficiency. The proliferation of data generated across different sectors such as healthcare, finance, retail, and telecommunications necessitates the adoption of tools that can clean, standardize, and enrich data to ensure its reliability and accuracy.

Furthermore, the rising adoption of Machine Learning (ML) and Artificial Intelligence (AI) technologies has underscored the importance of clean data. These technologies rely heavily on large datasets to provide accurate and reliable insights. Any errors or inconsistencies in data can lead to erroneous outcomes, making data cleansing tools indispensable. Additionally, regulatory and compliance requirements across various industries necessitate the maintenance of clean and accurate data, further driving the market for data cleansing tools.

The growing trend of digital transformation across industries is another critical growth factor. As businesses increasingly transition from traditional methods to digital platforms, the volume of data generated has skyrocketed. However, this data often comes from disparate sources and in various formats, leading to inconsistencies and errors. Data cleansing tools are essential in such scenarios to integrate data from multiple sources and ensure its quality, thus enabling organizations to derive actionable insights and maintain a competitive edge.

In the context of ensuring data reliability and accuracy, Data Quality Software and Solutions play a pivotal role. These solutions are designed to address the challenges associated with managing large volumes of data from diverse sources. By implementing robust data quality frameworks, organizations can enhance their data governance strategies, ensuring that data is not only clean but also consistent and compliant with industry standards. This is particularly crucial in sectors where data-driven decision-making is integral to business success, such as finance and healthcare. The integration of advanced data quality solutions helps businesses mitigate risks associated with poor data quality, thereby enhancing operational efficiency and strategic planning.

Regionally, North America is expected to hold the largest market share due to the early adoption of advanced technologies, robust IT infrastructure, and the presence of key market players. Europe is also anticipated to witness substantial growth due to stringent data protection regulations and the increasing adoption of data-driven decision-making processes. Meanwhile, the Asia Pacific region is projected to experience the highest growth rate, driven by the rapid digitalization of emerging economies, the expansion of the IT and telecommunications sector, and increasing investments in data management solutions.

Component Analysis

The data cleansing tools market is segmented into software and services based on components. The software segment is anticipated to dominate the market due to its extensive use in automating the data cleansing process. The software solutions are designed to identify, rectify, and remove errors in data sets, ensuring data accuracy and consistency. They offer various functionalities such as data profiling, validation, enrichment, and standardization, which are critical in maintaining high data quality. The high demand for these functionalities across various industries is driving the growth of the software segment.

On the other hand, the services segment, which includes professional services and managed services, is also expected to witness significant growth. Professional services such as consulting, implementation, and training are crucial for organizations to effectively deploy and utilize data cleansing tools. As businesses increasingly realize the importance of clean data, the demand for expert
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
British Indian Ocean Territory, Moldova (Republic of), Bangladesh, Isle of Man, Northern Mariana Islands, Taiwan, Canada, Nepal, Tunisia, Andorra
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Data from: COVID-19 Case Surveillance Public Use Data with Geography
data.cdc.gov
data.virginia.gov
+5more
application/rdfxml +5
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data with Geography [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4
Explore at:
application/rssxml, csv, tsv, application/rdfxml, xml, jsonAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 19 elements for all COVID-19 cases shared with CDC and includes demographics, geography (county and state of residence), any exposure history, disease severity indicators and outcomes, and presence of any underlying medical conditions and risk behaviors.

Currently, CDC provides the public with three versions of COVID-19 case surveillance line-listed data: this 19 data element dataset with geography, a 12 data element public use dataset, and a 33 data element restricted access dataset.

The following apply to the public use datasets and the restricted access dataset:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data are suppressed to protect individual privacy.
Datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensure that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

COVID-19 Case Reports COVID-19 case reports are routinely submitted to CDC by public health jurisdictions using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19. Current versions of these case definitions are available at: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/. All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for lab-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. States and territories continue to use this form.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.

Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question "Was the individual hospitalized?" where the possible answer choices include "Yes," "No," or "Unknown," the blank value is recoded to "Missing" because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race, ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<11 COVID-19 case records with a given values). Suppression includes low frequency combinations of case month, geographic characteristics (county and state of residence), and demographic characteristics (sex, age group, race, and ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These and other COVID-19 data are available from multiple public locations: COVID Data Tracker; United States COVID-19 Cases and Deaths by State; COVID-19 Vaccination Reporting Data Systems; and COVID-19 Death Data and Resources.

Notes:

March 1, 2022: The "COVID-19 Case Surveillance Public Use Data with Geography" will be updated on a monthly basis.

April 7, 2022: An adjustment was made to CDC’s cleaning algorithm for COVID-19 line level case notification data. An assumption in CDC's algorithm led to misclassifying deaths that were not COVID-19 related. The algorithm has since been revised, and this dataset update reflects corrected individual level information about death status for all cases collected to date.

June 25, 2024: An adjustment
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Alpaca Cleaned
kaggle.com
huggingface.co
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca Cleaned

Improving Pretrained Language Model Understanding

By Huggingface Hub [source]

About this dataset

Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

To make the most out of this dataset it is recommended to:

Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

Research Ideas

Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.

Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.

Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Q&A for Admission of Higher Education Institution
kaggle.com
Updated Oct 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jocelyn Dumlao (2024). Q&A for Admission of Higher Education Institution [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/q-and-a-for-admission-of-higher-education-institution/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jocelyn Dumlao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Question Answering for Admission of Higher Education Institution

Description

The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.

Categories:

Computer Science, Education, Marketing, Natural Language Processing

Acknowledgements & Source

Emny Yossy,Derwin Suhartono,Agung Trisetyarso,Widodo Budiharto

Data Source: Mendeley Data

A dataset of Data Subject Access Request Packages

zenodo.org

Updated Jul 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Nicola Leschke; Nicola Leschke; Daniela Pöhn; Daniela Pöhn; Frank Pallas; Frank Pallas (2024). A dataset of Data Subject Access Request Packages [Dataset]. http://doi.org/10.5281/zenodo.11634938

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.11634938

Dataset updated

Jul 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nicola Leschke; Nicola Leschke; Daniela Pöhn; Daniela Pöhn; Frank Pallas; Frank Pallas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Overview

This dataset is a minimal example of Data Subject Access Request Packages (SARPs), as they can be retrieved under data protection laws, specifically the GDPR. It includes data from two data subjects, each with accounts for five major sevices, namely Amazon, Apple, Facebook, Google, and Linkedin.

Purpose and Usage

This dataset is meant to be an initial dataset that allows for manual exploration of structures and contents found in SARPs. Hence, the number of controllers and user profiles should be minimal but sufficient to allow cross-subject and cross-controller analysis. This dataset can be used to explore structures, formats and data types found in real-world SARPs. Thereby, the planning of future SARP-based research projects and studies shall be facilitated.

We invite other researchers to use this dataset to explore the structure of SARPs. The envisioned primary usage includes the development of user-centric privacy interfaces and other technical contributions in the area of data access rights. Moreover, these packages can also be used for examplified data analyses, although no substantive research questions can be answered using this data. In particular, this data does not reflect how data subjects behave in real world. However, it is representative enough to give a first impression on the types of data analysis possible when using real world data.

Data Generation

In order to allow cross-subject analysis, while keeping the re-identification risk minimal, we used research-only accounts for the data generation. A detailed explanation of the data generation method can be found in the paper corresponding to the dataset, accepted for the Annual Privacy Forum 2024.

In short, two user profiles were designed and corresponding accounts were created for each of the five services. Then, those accounts were used for two to four month. During the usage period, we minimized the amount of identifying data and also avoided interactions with data subjects not part of this research. Afterwards, we performed a data access request via the controller's web interface. Finally, the data was cleansed as described in detail in the acconpanying paper and in brief within the following section.

Data Cleansing

Before publication, both possibly identifying information and security relevant attributes need to be obfuscated or deleted. Moreover, multi-party data (especially messages with external entities) must be deleted. If data is obfuscated, we made sure to substitute multiple occurances of the same information with the same replacement.
We provide a list of deleted and obfuscated items, the obfuscation scheme and, if applicable, the replacement.

The list of obfuscated items looks like the following example:

path	filetype	filename	attribute	scheme	replacement
linkedin\Linkedin_Basic	csv	messages.csv	TO	semantic description	Firstname Lastname
gooogle\Meine Aktivitäten\Datenexport	html	MeineAktivitäten.html	IP Address	loopback	127.142.201.194
facebook\personal_information	json	profile_information.json	emails	semantic description	firstname.lastname@gmail.com

Data Characterization

To give you an overview of the dataset, we publicly provide some meta-data about the usage time and SARP characteristics of exports from subject A/ subject B.

provider	usage time (in month)	export options	file types	# subfolders	# files	export size
Amazon	2/4	all categories	CSV (32/49) EML (2/5) JPEG (1/2) JSON (3/3) PDF (9/10) TXT (4/4)	41/49	51/73	1.2 MB / 1.4 MB
Apple	2/4	all data max. 1 GB/ max. 4 GB	CSV (8/3)	20/1	8/3	71.8 KB / 294.8 KB
Facebook	2/4	all data JSON/HTML on my computer	JSON (39/0) HTML (0/63) TXT (29/28) JPG (0/4) PNG (1/15) GIF (7/7)	45/76	76/117	12.3 MB / 13.5 MB
Google	2/4	all data frequency once ZIP max. 4 GB	HTML (8/11) CSV (10/13) JSON (27/28) TXT (14/14) PDF (1/1) MBOX (1/1) VCF (1/0) ICS (1/0) README (1/1) JPG (0/2)	44/51	64/71	1.54 MB /1.2 MB
LinkedIn	2/4	all data	CSV (18/21)	0/0 (part 1/2) 0/0 (part 1/2)	13/18 19/21	3.9 KB / 6.0 KB 6.2 KB / 9.2 KB

Authors

This data collection was performed by Daniela Pöhn (Universität der Bundeswehr München, Germany), Frank Pallas and Nicola Leschke (Paris Lodron Universität Salzburg, Austria). For questions, please contact nicola.leschke@plus.ac.at.

Accompanying Paper

The dataset was collected according to the method presented in:
Leschke, Pöhn, and Pallas (2024). "How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages". Accepted for Annual Privacy Forum 2024.

f
Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s001
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Vehicle CO2 Emissions Dataset
kaggle.com
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Batuhan Şahan (2024). Vehicle CO2 Emissions Dataset [Dataset]. https://www.kaggle.com/datasets/brsahan/vehicle-co2-emissions-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Batuhan Şahan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains information on vehicle specifications, fuel consumption, and CO2 emissions, collected to analyze the environmental impact of vehicles and predict their CO2 emissions using regression models. The dataset is structured to support both Simple Linear Regression (SLR) and Multiple Linear Regression (MLR) approaches for machine learning projects.

Key Features Brand: The brand or manufacturer of the vehicle (e.g., Toyota, Ford, BMW). Vehicle Type: Classification of vehicles based on size and usage (e.g., SUV, Sedan). Engine Size (L): Engine displacement volume in liters. Cylinders: Number of cylinders in the engine. Transmission: Type of transmission (e.g., Automatic, Manual). Fuel Type: Type of fuel used by the vehicle (e.g., Gasoline, Diesel, Hybrid). Fuel Consumption (City, Hwy, and Combined): Fuel efficiency measured in liters per 100 kilometers (L/100 km). CO2 Emissions (g/km): Carbon dioxide emissions per kilometer (target variable for prediction). Use Cases Exploratory Data Analysis (EDA): Clean and analyze the dataset to understand trends in CO2 emissions based on engine size, fuel type, and vehicle class. Simple Linear Regression (SLR): Use Engine Size (L) to predict CO2 Emissions (g/km). Multiple Linear Regression (MLR): Incorporate additional features like Fuel Consumption and Cylinders to create more accurate prediction models. Objective This dataset is designed to:

Help understand the relationship between vehicle specifications and their environmental impact. Enable the application of regression models for predicting CO2 emissions. Explore the impact of fuel efficiency and engine size on carbon emissions. Why This Dataset? Environmental concerns are a critical issue, and analyzing CO2 emissions is essential to mitigate climate change. It offers a practical application for machine learning students and professionals to develop predictive models. Provides an opportunity to practice EDA, data cleaning, and regression modeling techniques. Dataset Highlights Suitable for both beginners and advanced practitioners in data science. Provides a hands-on opportunity to work on real-world data. Perfect for showcasing machine learning skills in regression analysis. Acknowledgements This dataset is provided for educational purposes and is not intended for commercial use. If used in research or publications, please provide proper citation.
R
Sam2.1_l Yolo11x United Cleaning Data Version 2 Dataset
universe.roboflow.com
zip
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Final Project (2025). Sam2.1_l Yolo11x United Cleaning Data Version 2 Dataset [Dataset]. https://universe.roboflow.com/final-project-mn2p5/sam2.1_l-yolo11x-united-cleaning-data-version-2
Explore at:
zipAvailable download formats
Dataset updated
Jun 2, 2025
Dataset authored and provided by
Final Project
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Car Person Bus Cyclist CGmu 3CqK QRcR Polygons
Description
Sam2.1_l Yolo11x United Cleaning Data Version 2

## Overview Sam2.1_l Yolo11x United Cleaning Data Version 2 is a dataset for instance segmentation tasks - it contains Car Person Bus Cyclist CGmu 3CqK QRcR annotations for 856 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
NSText2SQL
huggingface.co
opendatalab.com
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NumbersStation (2024). NSText2SQL [Dataset]. https://huggingface.co/datasets/NumbersStation/NSText2SQL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2024
Dataset authored and provided by
NumbersStation
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Summary

NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs. For more… See the full description on the dataset page: https://huggingface.co/datasets/NumbersStation/NSText2SQL.
R
Data_cleaning_val1 Dataset
universe.roboflow.com
zip
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
prohibitedItemsdataCleainingval (2022). Data_cleaning_val1 Dataset [Dataset]. https://universe.roboflow.com/prohibiteditemsdatacleainingval/data_cleaning_val1
Explore at:
zipAvailable download formats
Dataset updated
Jul 12, 2022
Dataset authored and provided by
prohibitedItemsdataCleainingval
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Knife Bounding Boxes
Description
Data_cleaning_val1

## Overview Data_cleaning_val1 is a dataset for object detection tasks - it contains Knife annotations for 2,347 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Qatar, Belize, Russian Federation, Benin, Djibouti, Iceland, Saudi Arabia, Antigua and Barbuda, Equatorial Guinea, Colombia
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
c
Rotten Tomatoes Movie Dataset – Clean Movie Metadata
crawlfeeds.com
csv, zip
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Rotten Tomatoes Movie Dataset – Clean Movie Metadata [Dataset]. https://crawlfeeds.com/datasets/rotten-tomatoes-movie-dataset-clean-movie-metadata
Explore at:
csv, zipAvailable download formats
Dataset updated
Jul 17, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
We provide a high-quality Rotten Tomatoes movie dataset that includes key metadata for thousands of movies. This dataset is ideal for anyone working with movie-related platforms, entertainment analytics, content curation, or movie discovery tools.

Our collection is structured, clean, and designed to support real-time apps, dashboards, and research use cases.

What the Dataset Includes

Each record in the dataset contains core information pulled directly from Rotten Tomatoes, including:

Movie Name – The official title of the movie.

Poster URL – High-resolution image link to the movie poster.

Trailer URL – Direct link to the official trailer (when available).

Genre – One or more genres associated with the movie, such as Action, Drama, Comedy, or Horror.

Release Date – The date the movie was released to the public.

Actors – Main cast members listed on Rotten Tomatoes.

Directors – Director(s) responsible for the movie.

Rating – Audience or critic scores, where available.

Broad Coverage

This dataset spans a wide range of movies across all major genres and decades. From modern releases to timeless classics, from Hollywood blockbusters to independent films — we’ve included movies of all types with relevant data points.

You can expect data on:

U.S. theatrical releases

Netflix, Amazon, and other streaming exclusives

Festival films and limited releases

Animated and documentary films

Use Cases

Here are just a few ways this dataset can be useful:

Movie Recommendation Engines – Use metadata and genre info to power personalized movie suggestions.

Entertainment Search Tools – Build searchable movie listings with visual poster previews and trailer links.

Data Visualization Projects – Create dashboards showing trends by genre, release periods, or actor participation.

AI/ML Training – Use metadata to train classification models or sentiment prediction tools.

Research & Academic Use – Analyze patterns in movie releases, cast dynamics, and genre evolution.

Why Use Our Dataset?

Clean & ready-to-use: No raw HTML, just clean structured data.

Minimal but meaningful fields: Focused on useful movie attributes without clutter.

Updated info: Covers both classic and current titles.

Simple integration: Easy to use for developers, analysts, and product teams.

If you're working on a movie-based product or looking for reliable film metadata for your project, this dataset offers an ideal foundation.

Let us know if you’d like to explore it further.
d
Job Postings Dataset for Labour Market Research and Insights
datarade.ai
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 20, 2023
Dataset authored and provided by
Oxylabs
Area covered
Togo, Jamaica, Kyrgyzstan, Sierra Leone, Luxembourg, Zambia, Tajikistan, Anguilla, Switzerland, British Indian Ocean Territory
Description
Introducing Job Posting Datasets: Uncover labor market insights!

Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

Job Posting Datasets Source:

Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

StackShare: Access StackShare datasets to make data-driven technology decisions.

Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

Choose your preferred dataset delivery options for convenience:

Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

Why Choose Oxylabs Job Posting Datasets:

Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dataintelo (2025). Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-cleaning-tools-market

Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033

Explore at:

pptx, pdf, csvAvailable download formats

Dataset updated

Jan 7, 2025

Dataset authored and provided by

Dataintelo

License

https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

Time period covered

2024 - 2032

Area covered

Global

Description

Data Cleaning Tools Market Outlook

As of 2023, the global market size for data cleaning tools is estimated at $2.5 billion, with projections indicating that it will reach approximately $7.1 billion by 2032, reflecting a robust CAGR of 12.1% during the forecast period. This growth is primarily driven by the increasing importance of data quality in business intelligence and analytics workflows across various industries.

The growth of the data cleaning tools market can be attributed to several critical factors. Firstly, the exponential increase in data generation across industries necessitates efficient tools to manage data quality. Poor data quality can result in significant financial losses, inefficient business processes, and faulty decision-making. Organizations recognize the value of clean, accurate data in driving business insights and operational efficiency, thereby propelling the adoption of data cleaning tools. Additionally, regulatory requirements and compliance standards also push companies to maintain high data quality standards, further driving market growth.

Another significant growth factor is the rising adoption of AI and machine learning technologies. These advanced technologies rely heavily on high-quality data to deliver accurate results. Data cleaning tools play a crucial role in preparing datasets for AI and machine learning models, ensuring that the data is free from errors, inconsistencies, and redundancies. This surge in the use of AI and machine learning across various sectors like healthcare, finance, and retail is driving the demand for efficient data cleaning solutions.

The proliferation of big data analytics is another critical factor contributing to market growth. Big data analytics enables organizations to uncover hidden patterns, correlations, and insights from large datasets. However, the effectiveness of big data analytics is contingent upon the quality of the data being analyzed. Data cleaning tools help in sanitizing large datasets, making them suitable for analysis and thus enhancing the accuracy and reliability of analytics outcomes. This trend is expected to continue, fueling the demand for data cleaning tools.

In terms of regional growth, North America holds a dominant position in the data cleaning tools market. The region's strong technological infrastructure, coupled with the presence of major market players and a high adoption rate of advanced data management solutions, contributes to its leadership. However, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period. The rapid digitization of businesses, increasing investments in IT infrastructure, and a growing focus on data-driven decision-making are key factors driving the market in this region.

As organizations strive to maintain high data quality standards, the role of an Email List Cleaning Service becomes increasingly vital. These services ensure that email databases are free from invalid addresses, duplicates, and outdated information, thereby enhancing the effectiveness of marketing campaigns and communications. By leveraging sophisticated algorithms and validation techniques, email list cleaning services help businesses improve their email deliverability rates and reduce the risk of being flagged as spam. This not only optimizes marketing efforts but also protects the reputation of the sender. As a result, the demand for such services is expected to grow alongside the broader data cleaning tools market, as companies recognize the importance of maintaining clean and accurate contact lists.

Component Analysis

The data cleaning tools market can be segmented by component into software and services. The software segment encompasses various tools and platforms designed for data cleaning, while the services segment includes consultancy, implementation, and maintenance services provided by vendors.

The software segment holds the largest market share and is expected to continue leading during the forecast period. This dominance can be attributed to the increasing adoption of automated data cleaning solutions that offer high efficiency and accuracy. These software solutions are equipped with advanced algorithms and functionalities that can handle large volumes of data, identify errors, and correct them without manual intervention. The rising adoption of cloud-based data cleaning software further bolsters this segment, as it offers scalability and ease of

Clear search

Close search

Google apps

Main menu

Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033

Data Cleaning Tools Market Outlook

Component Analysis

Restaurant Sales-Dirty Data for Cleaning Training

Restaurant Sales Dataset with Dirt Documentation

Overview

Dataset Use Cases

Columns Description

Key Characteristics

Cleaning Suggestions

Menu Map with Prices and Categories

Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

Data Cleansing Tools Market Report | Global Forecast From 2025 To 2033

Data Cleansing Tools Market Outlook

Component Analysis

Company Datasets for Business Profiling

Data from: COVID-19 Case Surveillance Public Use Data with Geography

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

LSC (Leicester Scientific Corpus)

Reddit r/AskScience Flair Dataset

Data and tools for studying isograms

Alpaca Cleaned

Alpaca Cleaned

Improving Pretrained Language Model Understanding

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Q&A for Admission of Higher Education Institution

Dataset Question Answering for Admission of Higher Education Institution

Description

Categories:

Acknowledgements & Source

A dataset of Data Subject Access Request Packages

Overview

Purpose and Usage

Data Generation

Data Cleansing

Data Characterization

Authors

Accompanying Paper

Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

Vehicle CO2 Emissions Dataset

Sam2.1_l Yolo11x United Cleaning Data Version 2 Dataset

Sam2.1_l Yolo11x United Cleaning Data Version 2

NSText2SQL

Data_cleaning_val1 Dataset

Data_cleaning_val1

TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

Rotten Tomatoes Movie Dataset – Clean Movie Metadata

What the Dataset Includes

Broad Coverage

Use Cases

Why Use Our Dataset?

Job Postings Dataset for Labour Market Research and Insights

Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033

Data Cleaning Tools Market Outlook

Component Analysis