63 datasets found
  1. Is it time to stop sweeping data cleaning under the carpet? A novel...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

  2. Household Survey on Information and Communications Technology 2023 - West...

    • pcbs.gov.ps
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2025). Household Survey on Information and Communications Technology 2023 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/733
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttps://pcbs.gov/
    Time period covered
    2023 - 2024
    Area covered
    West Bank, Gaza Strip, Gaza
    Description

    Abstract

    The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2023. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

    Geographic coverage

    Palestine, West Bank, Gaza strip

    Analysis unit

    Household, Individual

    Universe

    All Palestinian households and individuals (10 years and above) whose usual place of residence in 2023 was in the state of Palestine.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

    Sample Size The sample size is 8,040 households.

    Sampling Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

    Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, camps).

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

    Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

    Section III: Data on Individuals (10 years and above) about computer use, access to the Internet, possession of a mobile phone, information threats, and E-commerce.

    Cleaning operations

    Field Editing and Supervising

    • Data collection and coordination were carried out in the field according to the pre-prepared plan, where instructions, models and tools were available for fieldwork. • Audit process on the PC-Tablet is through the establishment of all automated rules and the office on the program to cover all the required controls according to the criteria specified. • For the privacy of Jerusalem (J1) data were collected in a paper questionnaire. Then the supervisor verifies the questionnaire in a formal and technical manner according to the pre-prepared audit rules. • Fieldwork visits was carried out by the project coordinator, supervisors and project management to check edited questionnaire and the performance of fieldworkers.

    Data Processing

    Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

    Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

    In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

    Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

    Response rate

    The response rate reached 83.7%.

    Sampling error estimates

    Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, there is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

    Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

    The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non-response cases. The total non-response rate reached 16.3%.

  3. R

    AI in Data Cleaning Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). AI in Data Cleaning Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-data-cleaning-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    AI in Data Cleaning Market Outlook



    According to our latest research, the global AI in Data Cleaning market size reached USD 1.82 billion in 2024, demonstrating remarkable momentum driven by the exponential growth of data-driven enterprises. The market is projected to grow at a CAGR of 28.1% from 2025 to 2033, reaching an estimated USD 17.73 billion by 2033. This exceptional growth trajectory is primarily fueled by increasing data volumes, the urgent need for high-quality datasets, and the adoption of artificial intelligence technologies across diverse industries.



    The surging demand for automated data management solutions remains a key growth driver for the AI in Data Cleaning market. As organizations generate and collect massive volumes of structured and unstructured data, manual data cleaning processes have become insufficient, error-prone, and costly. AI-powered data cleaning tools address these challenges by leveraging machine learning algorithms, natural language processing, and pattern recognition to efficiently identify, correct, and eliminate inconsistencies, duplicates, and inaccuracies. This automation not only enhances data quality but also significantly reduces operational costs and improves decision-making capabilities, making AI-based solutions indispensable for enterprises aiming to achieve digital transformation and maintain a competitive edge.



    Another crucial factor propelling market expansion is the growing emphasis on regulatory compliance and data governance. Sectors such as BFSI, healthcare, and government are subject to stringent data privacy and accuracy regulations, including GDPR, HIPAA, and CCPA. AI in data cleaning enables these industries to ensure data integrity, minimize compliance risks, and maintain audit trails, thereby safeguarding sensitive information and building stakeholder trust. Furthermore, the proliferation of cloud computing and advanced analytics platforms has made AI-powered data cleaning solutions more accessible, scalable, and cost-effective, further accelerating adoption across small, medium, and large enterprises.



    The increasing integration of AI in data cleaning with other emerging technologies such as big data analytics, IoT, and robotic process automation (RPA) is unlocking new avenues for market growth. By embedding AI-driven data cleaning processes into end-to-end data pipelines, organizations can streamline data preparation, enable real-time analytics, and support advanced use cases like predictive modeling and personalized customer experiences. Strategic partnerships, investments in R&D, and the rise of specialized AI startups are also catalyzing innovation in this space, making AI in data cleaning a cornerstone of the broader data management ecosystem.



    From a regional perspective, North America continues to lead the global AI in Data Cleaning market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The region’s dominance is attributed to the presence of major technology vendors, robust digital infrastructure, and high adoption rates of AI and cloud technologies. Meanwhile, Asia Pacific is witnessing the fastest growth, propelled by rapid digitalization, expanding IT sectors, and increasing investments in AI-driven solutions by enterprises in China, India, and Southeast Asia. Europe remains a significant market, supported by strict data protection regulations and a mature enterprise landscape. Latin America and the Middle East & Africa are emerging as promising markets, albeit at a relatively nascent stage, with growing awareness and gradual adoption of AI-powered data cleaning solutions.



    Component Analysis



    The AI in Data Cleaning market is broadly segmented by component into software and services, with each segment playing a pivotal role in shaping the industry’s evolution. The software segment dominates the market, driven by the rapid adoption of advanced AI-based data cleaning platforms that automate complex data preparation tasks. These platforms leverage sophisticated algorithms to detect anomalies, standardize formats, and enrich datasets, thereby enabling organizations to maintain high-quality data repositories. The increasing demand for self-service data cleaning software, which empowers business users to cleanse data without extensive IT intervention, is further fueling growth in this segment. Vendors are continuously enhancing their offerings with intuitive interfaces, integration capabilities, and support for diverse data sources to cater to a wide r

  4. i

    Household Income and Expenditure 2010 - Tuvalu

    • catalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistics Division (2019). Household Income and Expenditure 2010 - Tuvalu [Dataset]. http://catalog.ihsn.org/catalog/3203
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Central Statistics Division
    Time period covered
    2010
    Area covered
    Tuvalu
    Description

    Abstract

    The main objectives of the survey were: - To obtain weights for the revision of the Consumer Price Index (CPI) for Funafuti; - To provide information on the nature and distribution of household income, expenditure and food consumption patterns; - To provide data on the household sector's contribution to the National Accounts - To provide information on economic activity of men and women to study gender issues - To undertake some poverty analysis

    Geographic coverage

    National, including Funafuti and Outer islands

    Analysis unit

    • Household
    • individual

    Universe

    All the private household are included in the sampling frame. In each household selected, the current resident are surveyed, and people who are usual resident but are currently away (work, health, holydays reasons, or border student for example. If the household had been residing in Tuvalu for less than one year: - but intend to reside more than 12 months => The household is included - do not intend to reside more than 12 months => out of scope

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    It was decided that 33% (one third) sample was sufficient to achieve suitable levels of accuracy for key estimates in the survey. So the sample selection was spread proportionally across all the island except Niulakita as it was considered too small. For selection purposes, each island was treated as a separate stratum and independent samples were selected from each. The strategy used was to list each dwelling on the island by their geographical position and run a systematic skip through the list to achieve the 33% sample. This approach assured that the sample would be spread out across each island as much as possible and thus more representative.

    For details please refer to Table 1.1 of the Report.

    Sampling deviation

    Only the island of Niulakita was not included in the sampling frame, considered too small.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    There were three main survey forms used to collect data for the survey. Each question are writen in English and translated in Tuvaluan on the same version of the questionnaire. The questionnaires were designed based on the 2004 survey questionnaire.

    HOUSEHOLD FORM - composition of the household and demographic profile of each members - dwelling information - dwelling expenditure - transport expenditure - education expenditure - health expenditure - land and property expenditure - household furnishing - home appliances - cultural and social payments - holydays/travel costs - Loans and saving - clothing - other major expenditure items

    INDIVIDUAL FORM - health and education - labor force (individu aged 15 and above) - employment activity and income (individu aged 15 and above): wages and salaries, working own business, agriculture and livestock, fishing, income from handicraft, income from gambling, small scale activies, jobs in the last 12 months, other income, childreen income, tobacco and alcohol use, other activities, and seafarer

    DIARY (one diary per week, on a 2 weeks period, 2 diaries per household were required) - All kind of expenses - Home production - food and drink (eaten by the household, given away, sold) - Goods taken from own business (consumed, given away) - Monetary gift (given away, received, winning from gambling) - Non monetary gift (given away, received, winning from gambling)

    Questionnaire Design Flaws Questionnaire design flaws address any problems with the way questions were worded which will result in an incorrect answer provided by the respondent. Despite every effort to minimize this problem during the design of the respective survey questionnaires and the diaries, problems were still identified during the analysis of the data. Some examples are provided below:

    Gifts, Remittances & Donations Collecting information on the following: - the receipt and provision of gifts - the receipt and provision of remittances - the provision of donations to the church, other communities and family occasions is a very difficult task in a HIES. The extent of these activities in Tuvalu is very high, so every effort should be made to address these activities as best as possible. A key problem lies in identifying the best form (questionnaire or diary) for covering such activities. A general rule of thumb for a HIES is that if the activity occurs on a regular basis, and involves the exchange of small monetary amounts or in-kind gifts, the diary is more appropriate. On the other hand, if the activity is less infrequent, and involves larger sums of money, the questionnaire with a recall approach is preferred. It is not always easy to distinguish between the two for the different activities, and as such, both the diary and questionnaire were used to collect this information. Unfortunately it probably wasn?t made clear enough as to what types of transactions were being collected from the different sources, and as such some transactions might have been missed, and others counted twice. The effects of these problems are hopefully minimal overall.

    Defining Remittances Because people have different interpretations of what constitutes remittances, the questionnaire needs to be very clear as to how this concept is defined in the survey. Unfortunately this wasn?t explained clearly enough so it was difficult to distinguish between a remittance, which should be of a more regular nature, and a one-off monetary gift which was transferred between two households.

    Business Expenses Still Recorded The aim of the survey is to measure "household" expenditure, and as such, any expenditure made by a household for an item or service which was primarily used for a business activity should be excluded. It was not always clear in the questionnaire that this was the case, and as such some business expenses were included. Efforts were made during data cleaning to remove any such business expenses which would impact significantly on survey results.

    Purchased goods given away as a gift When a household makes a gift donation of an item it has purchased, this is recorded in section 5 of the diary. Unfortunately it was difficult to know how to treat these items as it was not clear as to whether this item had been recorded already in section 1 of the diary which covers purchases. The decision was made to exclude all information of gifts given which were considered to be purchases, as these items were assumed to have already been recorded already in section 1. Ideally these items should be treated as a purchased gift given away, which in turn is not household consumption expenditure, but this was not possible.

    Some key items missed in the Questionnaire Although not a big issue, some key expenditure items were omitted from the questionnaire when it would have been best to collect them via this schedule. A key example being electric fans which many households in Tuvalu own.

    Cleaning operations

    Consistency of the data: - each questionnaire was checked by the supervisor during and after the collection - before data entry, all the questionnaire were coded - the CSPRo data entry system included inconsistency checks which allow the NSO staff to point some errors and to correct them with imputation estimation from their own knowledge (no time for double entry), 4 data entry operators. - after data entry, outliers were identified in order to check their consistency.

    All data entry, including editing, edit checks and queries, was done using CSPro (Census Survey Processing System) with additional data editing and cleaning taking place in Excel.

    The staff from the CSD was responsible for undertaking the coding and data entry, with assistance from an additional four temporary staff to help produce results in a more timely manner.

    Although enumeration didn't get completed until mid June, the coding and data entry commenced as soon as forms where available from Funafuti, which was towards the end of March. The coding and data entry was then completed around the middle of July.

    A visit from an SPC consultant then took place to undertake initial cleaning of the data, primarily addressing missing data items and missing schedules. Once the initial data cleaning was undertaken in CSPro, data was transferred to Excel where it was closely scrutinized to check that all responses were sensible. In the cases where unusual values were identified, original forms were consulted for these households and modifications made to the data if required.

    Despite the best efforts being made to clean the data file in preparation for the analysis, no doubt errors will still exist in the data, due to its size and complexity. Having said this, they are not expected to have significant impacts on the survey results.

    Under-Reporting and Incorrect Reporting as a result of Poor Field Work Procedures The most crucial stage of any survey activity, whether it be a population census or a survey such as a HIES is the fieldwork. It is crucial for intense checking to take place in the field before survey forms are returned to the office for data processing. Unfortunately, it became evident during the cleaning of the data that fieldwork wasn?t checked as thoroughly as required, and as such some unexpected values appeared in the questionnaires, as well as unusual results appearing in the diaries. Efforts were made to indentify the main issues which would have the greatest impact on final results, and this information was modified using local knowledge, to a more reasonable answer, when required.

    Data Entry Errors Data entry errors are always expected, but can be kept to a minimum with

  5. Nashville Housing Data Cleaning Project

    • kaggle.com
    zip
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Elhelbawy (2024). Nashville Housing Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/elhelbawylogin/nashville-housing-data-cleaning-project/discussion
    Explore at:
    zip(1282 bytes)Available download formats
    Dataset updated
    Aug 20, 2024
    Authors
    Ahmed Elhelbawy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Nashville
    Description

    Project Overview : This project demonstrates a thorough data cleaning process for the Nashville Housing dataset using SQL. The script performs various data cleaning and transformation operations to improve the quality and usability of the data for further analysis.

    Technologies Used : SQL Server T-SQL

    Dataset: The project uses the Nashville Housing dataset, which contains information about property sales in Nashville, Tennessee. The original dataset includes various fields such as property addresses, sale dates, sale prices, and other relevant real estate information. Data Cleaning Operations The script performs the following data cleaning operations:

    Date Standardization: Converts the SaleDate column to a standard Date format for consistency and easier manipulation. Populating Missing Property Addresses: Fills in NULL values in the PropertyAddress field using data from other records with the same ParcelID. Breaking Down Address Components: Separates the PropertyAddress and OwnerAddress fields into individual columns for Address, City, and State, improving data granularity and queryability. Standardizing Values: Converts 'Y' and 'N' values to 'Yes' and 'No' in the SoldAsVacant field for clarity and consistency. Removing Duplicates: Identifies and removes duplicate records based on specific criteria to ensure data integrity. Dropping Unused Columns: Removes unnecessary columns to streamline the dataset.

    Key SQL Techniques Demonstrated :

    Data type conversion Self joins for data population String manipulation (SUBSTRING, CHARINDEX, PARSENAME) CASE statements Window functions (ROW_NUMBER) Common Table Expressions (CTEs) Data deletion Table alterations (adding and dropping columns)

    Important Notes :

    The script includes cautionary comments about data deletion and column dropping, emphasizing the importance of careful consideration in a production environment. This project showcases various SQL data cleaning techniques and can serve as a template for similar data cleaning tasks.

    Potential Improvements :

    Implement error handling and transaction management for more robust execution. Add data validation steps to ensure the cleaned data meets specific criteria. Consider creating indexes on frequently queried columns for performance optimization.

  6. D

    Vendor Master Data Cleansing Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Vendor Master Data Cleansing Market Research Report 2033 [Dataset]. https://dataintelo.com/report/vendor-master-data-cleansing-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Vendor Master Data Cleansing Market Outlook



    According to our latest research, the global Vendor Master Data Cleansing market size reached USD 1.42 billion in 2024, with a robust compound annual growth rate (CAGR) of 13.2% projected through the forecast period. By 2033, the market is expected to expand significantly, achieving a value of USD 4.13 billion. This growth is primarily fueled by the increasing need for accurate, consistent, and reliable vendor data across enterprises to support digital transformation and regulatory compliance initiatives. The rapid digitalization of procurement and supply chain processes, coupled with the mounting pressure to eliminate data redundancies and errors, is further propelling the adoption of vendor master data cleansing solutions worldwide.




    A key growth factor for the Vendor Master Data Cleansing market is the accelerating pace of digital transformation across industries. Organizations are increasingly investing in advanced data management solutions to enhance the quality of their vendor databases, which are critical for procurement efficiency, risk mitigation, and regulatory compliance. As businesses expand their supplier networks globally, maintaining accurate and up-to-date vendor information has become a strategic priority. Poor data quality can lead to duplicate payments, compliance risks, and operational inefficiencies, making data cleansing solutions indispensable. Furthermore, the proliferation of cloud-based Enterprise Resource Planning (ERP) and procurement platforms is amplifying the demand for seamless integration and automated data hygiene processes, contributing to the market’s sustained growth.




    Another significant driver is the evolving regulatory landscape, particularly in sectors such as BFSI, healthcare, and government, where stringent data governance and audit requirements prevail. Regulatory mandates like GDPR, SOX, and industry-specific compliance frameworks necessitate organizations to maintain clean, accurate, and auditable vendor records. Failure to comply can result in hefty penalties and reputational damage. Consequently, enterprises are prioritizing investments in vendor master data cleansing tools and services that offer automated validation, deduplication, and enrichment capabilities. These solutions not only ensure compliance but also empower organizations to derive actionable insights from their vendor data, optimize supplier relationships, and negotiate better terms.




    The rise of advanced technologies such as artificial intelligence (AI), machine learning (ML), and robotic process automation (RPA) is also reshaping the vendor master data cleansing landscape. Modern solutions leverage AI and ML algorithms to identify anomalies, detect duplicates, and standardize vendor data at scale. Automation is reducing manual intervention, minimizing errors, and accelerating the cleansing process, thereby delivering higher accuracy and cost efficiency. Moreover, the integration of data cleansing with analytics platforms enables organizations to unlock deeper insights into vendor performance, risk exposure, and procurement trends. As enterprises strive to become more data-driven, the adoption of intelligent vendor master data cleansing solutions is expected to surge, further fueling market expansion.




    From a regional perspective, North America currently dominates the Vendor Master Data Cleansing market, driven by early technology adoption, a mature enterprise landscape, and stringent regulatory requirements. Europe follows closely, with strong demand from industries such as manufacturing, healthcare, and finance. The Asia Pacific region is emerging as a high-growth market, fueled by rapid industrialization, expanding SME sector, and increasing investments in digital infrastructure. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as organizations in these regions recognize the value of data quality in enhancing operational efficiency and competitiveness. Overall, the global outlook for the vendor master data cleansing market remains highly positive, with strong growth prospects across all major regions.



    Component Analysis



    The Component segment of the Vendor Master Data Cleansing market is bifurcated into software and services, each playing a pivotal role in meeting the diverse needs of enterprises. The software segment is witnessing robust growth, driven by the increasing a

  7. Additional Data [Predict Students Performance]

    • kaggle.com
    zip
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gleb Kazakov (2023). Additional Data [Predict Students Performance] [Dataset]. https://www.kaggle.com/datasets/glipko/additional-data-predict-students-performance
    Explore at:
    zip(704709399 bytes)Available download formats
    Dataset updated
    May 24, 2023
    Authors
    Gleb Kazakov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset was created for the competition "Predict Student Performance from Game Play" which aims to predict student performance during game-based learning in real-time based on their game logs. The dataset's source raw data is available on the developers's site, which can be used as supplemental data. The idea for this dataset was discovered in this discussion.

    Generating Script

    To extract the data, I used my notebook.

    Content

    The dataset consists of two file types: 1. Files with train data (_train suffix) 2. Files with labels (_labels suffix) for each non-empty monthly dataset and its ID. There are 20 monthly datasets available on the mentioned site.

    I tried to replicate the competition's data format as closely as possible, which involved:

    1. Creating only necessary columns
    2. Removing irrelevant data For example, navigate_hover events and quiz logs that are not present in the competition, were removed. However, if you find any inconsistencies in the dataset or in the generating script, please do share!

    I also added save codes, so you can find out if players started from one of the saves. As I know in competition's dataset all players started from the beggining so you may like to ignore players, who use save codes.

    Game Quits

    One interesting aspect of the raw data is that it includes users who quit the game before it ended and may have stopped playing before completing a quiz. I only included users who passed at least the first quiz, which opens up possibilities to supplement data for the first level group, which has the least amount of features.

    Implementing all the new logic with this dataset into pipelines may be difficult, and increasing train size may lead to memory errors. Additionally, some sessions are already present in the competition and must be ignored.

    Motivation

    I am sharing this dataset with the Kaggle community because I have university exams and do not have enough time to make the implementation myself. However, I believe that supplemental data with proper data cleaning techniques will greatly boost performance. Good luck!

  8. H

    Error_Checking_Imported_Data

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Ebert (2022). Error_Checking_Imported_Data [Dataset]. http://doi.org/10.7910/DVN/IYYU1P
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Timothy Ebert
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The SAS command file checks EPG data for errors. It will always run. However, you will only get the correct output if there are no errors. The data sets "simple psyllid" and "simple aphid" have no errors. Errorchecker will return two tables. The first is a record of the waveforms. Check to make sure that all waveforms are correct. A number of common errors will show here. PLEASE note that Np is a different waveform from NP or nP or np. Also, "NP" is different from " NP" and "NP " or " NP ". I wrote the code to be insensitive to these conditions using the condense() function to eliminate spaces and the upcase() function to make all letters capitals. However, it is safer to correct the problem rather than relying on the program. The second table is a frequency table showing all the transitions and transitional probabilities. Check to make sure that all transitions are possible. Your data is clean if you get these two tables and there are no problems evident in the tables.

  9. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  10. Chicago Uber/Lyft Vehicles

    • kaggle.com
    zip
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ye Joo Park (2021). Chicago Uber/Lyft Vehicles [Dataset]. https://www.kaggle.com/subwaymatch/chicago-uberlyft-vehicles
    Explore at:
    zip(9871897 bytes)Available download formats
    Dataset updated
    Mar 20, 2021
    Authors
    Ye Joo Park
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    Chicago
    Description

    Context

    There have been four Transportation Network Providers (often called rideshare companies 🚗) licensed to operate in Chicago. These rideshare companies are required to routinely report vehicles, drivers, and trips information to the City of Chicago, which are published to the Chicago Data Portal. The registered vehicles dataset is a great source to hone your data analytics skills!

    Content

    The reporting is done on a monthly basis, as indicated in the REPORTED_YEAR and REPORTED_MONTH columns. For each registered vehicle at a given month, the following information is provided:

    1. REPORTD_YEAR: The year in which the vehicle was reported
    2. REPORTED_MONTH: The month in which the vehicle was reported
    3. STATE: The state of the license plate
    4. MAKE: The make of the vehicle
    5. MODEL: The model of the vehicle
    6. COLOR: The color of the vehicle
    7. MODEL_YEARThe model year of the vehicle
    8. NUMBER_OF_TRIPS: Number of trips provided in this month. Due to the complexities of matching, errors are possible in both directions. Values over 999 are converted to null as suspected error values that interfere with easy data visualization
    9. MULTIPLE_TNPS: Whether the vehicle was reported by multiple TNPs in this month. Matching is imperfect so some vehicle records that should have been combined may be separate.

    Changes from the original dataset

    The original dataset has been transformed and filtered to be more usable. I've summarized the cleaning process in this blog post. The number of rows has been reduced from ~7.5 million rows to ~1.4 million rows.

    • Remove rows with missing values in MAKE, MODEL, COLOR, YEAR columns
    • Filter vehicles with 100 or more trips
    • Filter vehicles with valid State 2-letter codes
    • Extract Year/Month Reported
    • Remove make & model combinations that are extremely rare
    • Remove LAST_INSPECTION_MONTH column

    Note that the dataset still contains areas for cleaning. Examples include:

    • Handle variations of a single vehicle make: Mercedesbenz, Merceds-benz, etc.
    • Colors contain some rows with double quotes: black vs "black".
    • Invalid color strings: " (just a double quote)

    Acknowledgements

    Chicago Data Portal

    Inspiration

    • Is there a trend in popular vehicle models over the years 2015-2019?
    • If you are a manager at a car sales company in Chicago, which make/model/color combinations should you focus more on promoting?
  11. i

    Southern and Eastern Africa Consortium for Monitoring Educational Quality...

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Southern and Eastern Africa Consortium for Monitoring Educational Quality (2019). Southern and Eastern Africa Consortium for Monitoring Educational Quality 2000 - Kingdom of Eswatini [Dataset]. https://datacatalog.ihsn.org/catalog/4715
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Southern and Eastern Africa Consortium for Monitoring Educational Quality
    Time period covered
    2000
    Area covered
    Eswatini
    Description

    Abstract

    In 1991 the International Institute for Educational Planning (IIEP) and a number of Ministries of Education in Southern and Eastern Africa began to work together in order to address training and research needs in Education. The focus for this work was on establishing long-term strategies for building the capacity of educational planners to monitor and evaluate the quality of their basic education systems. The first two educational policy research projects undertaken by SACMEQ (widely known as "SACMEQ I" and "SACMEQ II") were designed to provide detailed information that could be used to guide planning decisions aimed at improving the quality of education in primary school systems.

    During 1995-1998 seven Ministries of Education participated in the SACMEQ I Project. The SACMEQ II Project commenced in 1998 and the surveys of schools, involving 14 Ministries of Education, took place between 2000 and 2004. The survey was undertaken in schools in Botswana, Kenya, Lesotho, Malawi, Mauritius, Mozambique, Namibia, Seychelles, South Africa, Swaziland, Tanzania, Uganda, Zambia and Zanzibar.

    Moving from the SACMEQ I Project (covering around 1100 schools and 20,000 pupils) to the SACMEQ II Project (covering around 2500 schools and 45,000 pupils) resulted in a major increase in the scale and complexity of SACMEQ's research and training programmes.

    SACMEQ's mission is to: a) Expand opportunities for educational planners to gain the technical skills required to monitor and evaluate the quality of their education systems; and b) Generate information that can be used by decision-makers to plan and improve the quality of education.

    Geographic coverage

    National coverage

    Analysis unit

    • Pupils
    • Teachers
    • Schools

    Universe

    The target population for SACMEQ's Initial Project was defined as "all pupils at the Grade 6 level in 1995 who were attending registered government or non-government schools". Grade 6 was chosen because it was the grade level where the basics of reading literacy were expected to have been acquired.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample designs used in the SACMEQ II Project were selected so as to meet the standards set down by the International Association for the Evaluation of Educational Achievement. These standards required that sample estimates of important pupil population parameters should have sampling accuracy that was at least equivalent to a simple random sample of 400 pupils (thereby guaranteeing 95 percent confidence limits for sample means of plus or minus one tenth of a pupil standard deviation unit).

    Some Constraints on Sample Design Sample designs in the field of education are usually prepared amid a network of competing constraints. These designs need to adhere to established survey sampling theory and, at the same time, give due recognition to the financial, administrative, and socio-political settings in which they are to be applied. The "best" sample design for a particular project is one that provides levels of sampling accuracy that are acceptable in terms of the main aims of the project, while simultaneously limiting cost, logistic, and procedural demands to manageable levels. The major constraints that were established prior to the preparation of the sample designs for the SACMEQ II Project have been listed below.

    Target Population: The target population definitions should focus on Grade 6 pupils attending registered mainstream government or non-government schools. In addition, the defined target population should be constructed by excluding no more than 5 percent of pupils from the desired target population.

    Bias Control: The sampling should conform to the accepted rules of scientific probability sampling. That is, the members of the defined target population should have a known and non-zero probability of selection into the sample so that any potential for bias in sample estimates due to variations from "epsem sampling" (equal probability of selection method) may be addressed through the use of appropriate sampling weights (Kish, 1965).

    Sampling Errors: The sample estimates for the main criterion variables should conform to the sampling accuracy requirements set down by the International Association for the Evaluation of Educational Achievement (Ross, 1991). That is, the standard error of sampling for the pupil tests should be of a magnitude that is equal to, or smaller than, what would be achieved by employing a simple random sample of 400 pupils (Ross, 1985).

    Response Rates: Each SACMEQ country should aim to achieve an overall response rate for pupils of 80 percent. This figure was based on the wish to achieve or exceed a response rate of 90 percent for schools and a response rate of 90 percent for pupils within schools.

    Administrative and Financial Costs: The number of schools selected in each country should recognize limitations in the administrative and financial resources available for data collection.

    Other Constraints: The number of pupils selected to participate in the data collection in each selected school should be set at a level that will maximize validity of the within-school data collection for the pupil reading and mathematics tests.

    Note: Detailed descriptions of the sample design, sample selection, and sample evaluation procedures have been presented in the "Swaziland Working Report".

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The data collection for SACMEQ’s Initial Project took place in October 1995 and involved the administration of questionnaires to pupils, teachers, and school heads. The pupil questionnaire contained questions about the pupils’ home backgrounds and their school life; the teacher questionnaire asked about classrooms, teaching practices, working conditions, and teacher housing; and the school head questionnaire collected information about teachers, enrolments, buildings, facilities, and management. A reading literacy test was also given to the pupils. The test was based on items that were selected after a trial-testing programme had been completed.

    Cleaning operations

    Data Checking and Data Entry Data preparation commenced soon after the main data collection was completed. The NRCs had to organize the safe return of all materials to the Ministry of Education where the data collection instruments could be checked, entered into computers, and then "cleaned" to remove errors prior to data analysis. The data-checking involved the "hand editing" of data collection instruments by a team of trained staff. They were required to check that: (i) all questionnaires, tests, and forms had arrived back from the sample schools, (ii) the identification numbers on all instruments were complete and accurate, and (iii) certain logical linkages between questions made sense (for example, the two questions to school heads concerning "Do you have a school library?" and "How many books do you have in your school library?").

    Data Cleaning The NRCs received written instructions and follow-up support from IIEP staff in the basic steps of data cleaning using the WINDEM software. This permitted the NRCs to (i) identify major errors in the sequence of identification numbers, (ii) cross-check identification numbers across files (for example, to ensure that all pupils were linked with their own reading and mathematics teachers), (iii) ensure that all schools listed on the original sampling frame also had valid data collection instruments and vice-versa, (iv) check for "wild codes" that occurred when some variables had values that fell outside pre-specified reasonable limits, and (v) validate that variables used as linkage devices in later file merges were available and accurate.

    A second phase of data preparation directed efforts towards the identification and correction of "wild codes" (which refer to data values that that fall outside credible limits), and "inconsistencies" (which refer to different responses to the same, or related, questions). There were also some errors in the identification codes for teachers that needed to be corrected before data could be merged.

    During 2002 a supplementary training programme was prepared and delivered to all countries via the Internet. This training led each SACMEQ Research Team step-by-step through the required data cleaning procedures - with the NRCs supervising "hands-on" data cleaning activities and IIEP staff occasionally using advanced software systems to validate the quality of the work involved in each data-cleaning step.

    This resulted in a "cyclical" process whereby data files were cleaned by the NRC and then emailed to the IIEP for checking and then emailed back to the NRC for further cleaning.

    Response rate

    Response rates for pupils and schools respectively were %92 and %99.

    Sampling error estimates

    The sample designs employed in the SACMEQ Projects departed markedly from the usual "textbook model" of simple random sampling. This departure demanded that special steps be taken in order to calculate "sampling errors" (that is, measures of the stability of sample estimates of population characteristics).

    In the report (Swaziland Working Report) a brief overview of various aspects of the general concept of "sampling error" has been presented. This has included a discussion of notions of "design effect", "the effective sample size", and the "Jackknife procedure" for estimating sampling errors.

  12. RUS_ORYX_tanks

    • kaggle.com
    zip
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier Hubert (2025). RUS_ORYX_tanks [Dataset]. https://www.kaggle.com/datasets/ol4ubert/rus-oryx-tanks
    Explore at:
    zip(275463 bytes)Available download formats
    Dataset updated
    Nov 11, 2025
    Authors
    Olivier Hubert
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Detailed graphically-documented daily losses of Russian tanks according to ORYX

    About the dataset

    Detailed graphically-documented daily losses of Russian tanks according to ORYX ventilated by the model of the tank, its generation and by series-year.

    Contents

    The Excel file contains 3 sheets: - Model_Level : the cumulative number of tanks lost (destroyed, damaged, abandoned and captured), ventilated by the precise model of the tank - Year_Level : the cumulative number of tanks lost (destroyed, damaged, abandoned and captured), ventilated by the decade in which the tank entered production - SeriesYear_Level : the cumulative number of tanks lost (destroyed, damaged, abandoned and captured), ventilated by the decade in which the tank entered production and the series the tank belongs to

    The column headers depend on the sheet : - Model_Level : the headers contain the name of the model. White spaces, dashes and other signs of punctuation have been removed - Year_Level : the headers contain the decade in which the tank entered production. It is preceded by an ‘x’. Unknown tanks are grouped under the ‘xUnknown’ header - SeriesYear_Level : the headers contains the concatenation of the decade (yyyy) and the series of the tank (up to four letters of the series name). Unknown tanks do not have an attributed decade.

    What sets this dataset apart?

    • Extensive data quality checks are performed to ensure internal consistency among the various levels of disaggregation
    • The dataset is not subject to posterior downward revision of the values (values ever increasing at every point in time)

    Method of collection

    The data was collected from the ORYX website on a daily basis. Since ORYX does not provide a real-time dataset, I obtain the real-time data by using the Wayback machine and then save the snapshot as HTML code for each date.

    Upon loading the HTML file for each day, I filter each level of aggregation by using the h3 tags and only select the Tanks. At the category level, the values are found within an h3 tag, while each specific piece of equipment is found in a bullet list.

    Although ORYX provides the number by item, one of its limitations is that the numbers reported may be different from the sum of the individual pieces because of inputting errors or miscategorization. I therefore contrast this information with the sum of the individual pieces of equipment listed according to their state (more on this in the following section).

    Cleaning and treatment of the data

    There is an extensive cleaning process involved: - Cleaning of the names to remove typographical errors - Correcting the aggregate number by equipment if that value is absurd given the amount of individual pieces of equipment - These checks ensure that the error between the aggregate and individual numbers are less than 5 in absolute terms or less than 5%, whichever condition is the most restrictive. The final number in the dataset is the minimum value between the aggregate and the individual numbers. - As there may be revisions after the first time the information is published, I make sure to take the minimum value of the remaining series. This ensures that the numbers I provide are the most conservative possible.

    Frequency of the dataset

    I plan on updating the dataset every week, with the dataset made available on Tuesday.

    Companion datasets

    All my datasets : - Russian losses (materiel and personnel) according to the Ukrainian Ministry of Defense : https://www.kaggle.com/datasets/ol4ubert/rus-modukr-equipmentpersonnel - Ukrainian losses (materiel and personnel) according to the Russian Ministry of Defense : https://www.kaggle.com/datasets/ol4ubert/ukr-modrus-equipmentpersonnel - Russian losses (materiel) according to ORYX : https://www.kaggle.com/datasets/ol4ubert/rus-oryx-equipment - Ukrainian losses (materiel) according to ORYX : https://www.kaggle.com/datasets/ol4ubert/ukr-oryx-equipment - Russian tank losses according to ORYX : https://www.kaggle.com/datasets/ol4ubert/rus-oryx-tanks - Ukrainian tank losses according to ORYX : https://www.kaggle.com/datasets/ol4ubert/ukr-oryx-tanks - Ukrainian personnel losses (UALosses) : https://www.kaggle.com/datasets/ol4ubert/confirmed-ukrainian-military-personnel-losses - Russian personnel losses (KilledInUkraine) : https://www.kaggle.com/datasets/ol4ubert/confirmed-russian-military-officers-losses - Ukrainian losses in Kursk (materiel and personnel) according to the Russian Ministry of Defense: https://www.kaggle.com/datasets/ol4ubert/ukrainian-military-losses-in-kursk-mod-russia

    Any comment is welcome. Please use the Discussion feature or send me an email directly.

  13. Student Performance and Attendance Dataset

    • kaggle.com
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marvy Ayman Halim (2025). Student Performance and Attendance Dataset [Dataset]. https://www.kaggle.com/datasets/marvyaymanhalim/student-performance-and-attendance-dataset
    Explore at:
    zip(5849540 bytes)Available download formats
    Dataset updated
    Mar 10, 2025
    Authors
    Marvy Ayman Halim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Description: This synthetic dataset is designed to help beginners and intermediate learners practice data cleaning and analysis in a realistic setting. It simulates a student tracking system, covering key areas like:

    Attendance tracking 📅

    Homework completion 📝

    Exam performance 🎯

    Parent-teacher communication 📢

    ✅ Why Use This Dataset? While many datasets are pre-cleaned, real-world data is often messy. This dataset includes intentional errors to help you develop essential data cleaning skills before diving into analysis. It’s perfect for building confidence in handling raw data!

    🛠️ Cleaning Challenges You’ll Tackle This dataset is packed with real-world issues, including:

    Messy data: Names in lowercase, typos in attendance status.

    Inconsistent date formats: Mix of MM/DD/YYYY and YYYY-MM-DD.

    Incorrect values: Homework completion rates in mixed formats (e.g., 80% and 90).

    Missing data: Guardian signatures, teacher comments, and emergency contacts.

    Outliers: Exam scores over 100 and negative homework completion rates.

    🚀 Your Task: Clean, structure, and analyze this dataset using Python or SQL to uncover meaningful insights!

    📌 5. Handle Outliers

    Remove exam scores above 100.

    Convert homework completion rates to consistent percentages.

    📌 6. Generate Insights & Visualizations

    What’s the average attendance rate per grade?

    Which subjects have the highest performance?

    What are the most common topics in parent-teacher communication?

  14. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2018 [Dataset]. http://doi.org/10.3886/E105403
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  15. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated Aug 14, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2019 [Dataset]. http://doi.org/10.3886/E105403V5
    Explore at:
    Dataset updated
    Aug 14, 2018
    Dataset provided by
    University of Pennsylvania
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1960 - 2019
    Area covered
    United States
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, cite it.Version 5 release notes:Adds data for 2019Note that the number of months reported variable sharply changes starting in 2018. This is probably due to changes in UCR reporting of the "status" variable which is used to generate the months missing county (the code I used does not change). So pre-2018 and 2018+ years may not be comparable for this variable. Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  16. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  17. Google Capstone Project Portfolio

    • kaggle.com
    zip
    Updated Sep 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin Roush (2021). Google Capstone Project Portfolio [Dataset]. https://www.kaggle.com/datasets/austinroush/google-capstone-project-cyclistic-case-study
    Explore at:
    zip(194362162 bytes)Available download formats
    Dataset updated
    Sep 8, 2021
    Authors
    Austin Roush
    Description

    Ask

    Business Task:

    Analyze Cyclistic historical bike trip data to identify trends that explain how annual members and casual riders differ. Transform data into actionable insights and create compelling data visualizations that explain why casual riders should purchase an annual membership. Design a new marketing strategy to convert casual riders into annual members. Use digital media to create effective marketing targeted at casual riders, explaining why it would be beneficial to become an annual member.

    Key stakeholders to be considered are Cyclistic customers, Lily Moreno, the Cyclistic marketing analytics team, as well as the Cyclistic executive team. Cyclistic customers include casual riders and members, some with disabilities that use assistive options. Only 30% of riders use Cyclistic to commute to work, while most riders use the bike-share service for leisure. Lily Moreno is the director of marketing. The marketing analytics team helps guide the marketing strategy. The executive team decides whether to approve the recommended marketing program.

    Prepare

    A description of all data sources used:

    Cyclistic bike-share historical trip data is public. It is located on the Divvy website. The .CSV files are sorted by year and month, dating back to 2013. The data is not in real-time, but it is current because it is published every month. Each file has comprehensive data on individual rider ID’s, bike type, time & date of trip, station location information, and whether each rider is a casual rider or a member.

    The Divvy website includes the following system data:

    Each trip is anonymized and includes: • Trip start day and time • Trip end day and time • Trip start station • Trip end station • Rider type (Member, Single Ride, and Day Pass) The data has been filtered to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).

    The Data License Agreement explains that Motivate International Inc. (“Motivate”) operates the City of Chicago’s (“City”) Divvy bike-share service. The City of Chicago is the owner of all Divvy data and makes it accessible to the public. Lyft is the operator of Divvy in Chicago. Lyft has a privacy policy that explains their commitment to respecting our personal information.

    The Divvy Data License Agreement explains the following:

    • License. Motivate hereby grants to you a non-exclusive, royalty-free, limited, perpetual license to access, reproduce, analyze, copy, modify, distribute in your product or service and use the Data for any lawful purpose (“License”).

    • No Warranty. THE DATA IS PROVIDED “AS IS,” AS AVAILABLE (AT MOTIVATE’S SOLE DISCRETION) AND AT YOUR SOLE RISK. TO THE MAXIMUM EXTENT PROVIDED BY LAW MOTIVATE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. MOTIVATE FURTHER DISCLAIMS ANY WARRANTY THAT THE DATA WILL MEET YOUR NEEDS OR WILL BE OR CONTINUE TO BE AVAILABLE, COMPLETE, ACCURATE, TIMELY, SECURE, OR ERROR FREE.

    In contrast to all Divvy system data being reliable, the “No Warranty” terms and conditions make it so that there is no guarantee if the data will be “AVAILABLE, COMPLETE, ACCURATE, TIMELY, SECURE, OR ERROR FREE.” The credibility of the data could potentially be negatively affected if they are not held responsible.

    Sampling bias could take place because Chicago is significantly affected by weather. There is also an influx of tourists at certain times of the year. Weather and tourism’s effect on data can be accounted for because these influences are constant.

    Divvy bike-share consistently providing accurate data is necessary to create and follow through with an effective marketing strategy. All the data is original and owned by the City of Chicago making it a credible source. Lyft is also a credible source because they have the technology to accurately collect data. Although the Data License Agreement states that it has “No Warranty,” the source of the data and the way it is managed makes it credible. Divvy bike-share data is cited using the following:

    • Divvy (https://www.divvybikes.com) • Divvy Historical Data (https://divvy-tripdata.s3.amazonaws.com/index.html) • Divvy System Data (https://www.divvybikes.com/system-data) • Divvy Data License Agreement (https://www.divvybikes.com/data-license-agreement) • Lyft’s Privacy Policy (https://www.lyft.com/privacy)

    The sources of the data confirm data credibility. The data is detailed and thorough making it effective and efficient for marketing purposes.

    Process

    Documentation of any cleaning or manipulation of data:

    1. Format Cells --> Alignment --> Shrink to Fit top row

    2. Data --> Remove Duplicates

    3. Create and calculate new ...

  18. COVID-19 High Frequency Survey 2020-2022 - Georgia

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Feb 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caucasus Research Resource Centers (CRRC) (2023). COVID-19 High Frequency Survey 2020-2022 - Georgia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3837
    Explore at:
    Dataset updated
    Feb 8, 2023
    Dataset provided by
    Caucasus Research Resource Centers
    Authors
    Caucasus Research Resource Centers (CRRC)
    Time period covered
    2020 - 2022
    Area covered
    Georgia
    Description

    Abstract

    Having reliable, timely data on poverty and inequality is critical to assess the distributional impact of and recovery from COVID-19 and high inflation on households and to make near-real time evidence-based strategic decisions. Partnering with the Swedish International Development Cooperation Agency (Sida) and Caucasus Research Resource Centers (CRRC), the South Caucasus team in the Poverty and Equity Global Practice at the World Bank conducted a series of Georgia High Frequency Survey to monitor the impact of these events on households in Georgia. This eighth round of the survey is augmented by including questions on the impact of high inflation, disruption in employment and schooling, concerns over environmental risks, and access to health services during the COVID-19 pandemic.

    Geographic coverage

    National coverage, representative at the national, rural/urban/Tbilisi-levels.

    Analysis unit

    Household, Individual (adult over age 18)

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The survey is based on phone-interviews with application of Computer Assisted Telephone Interviews (CATI) and random digit dialing (RDD). The sampling frame is representative of the national and rural/urban/Tbilisi population. Around 2000 valid interviews were concluded in each round with response rates around 40%.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The COVID-19 Georgia High Frequency Survey (GHFS) 2020-22 Wave 1 comprises following modules: 1- Household Identification, 2- Household Demographics, 3- Assets and Access to Internet, 4- Prevalence ofCOVID-19, 5- Distance Learning, 6- Employment Dynamics, 7- Income, 8- Food Security, 9- Shocks and Coping Strategies, 10- Vaccine, 11- Perception.

    In waves 2, 3, and 4, module on remittances was added. In wave 4, module on shocks and coping strategies was dropped, but question on job disruption was added. In waves 5 and 6, modules on inflation impact and time use were added. Questions on income and remittances were dropped. In wave 7, questions on income and remittances were brought back. In wave 8, questions on the perception of environmental risks, perception of the country's development, and health services accessibility during the COVID-19 pandemic were added.

    Cleaning operations

    Data cleaning was carried out to identify and, where possible, correct inconsistencies. In addition, open-ended questions with textual responses were recoded so that these answers matched numeric codes. With CATI, the cleaning process was straightforward: pre-programmed questionnaire forms helped to eliminate ambiguous codes from being entered in the dataset. Also, the form did not accept errors related to selecting more values than permitted in the questionnaire. Additional protocols for data cleaning are summarized in the CRRC Fieldwork Report.

    Response rate

    Response rates were around 40%.

  19. f

    Rules for resolving Mendelian inconsistencies in nuclear pedigrees typed for...

    • datasetcatalog.nlm.nih.gov
    Updated Mar 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manzoor, Sadaf; Alamgir; Khalil, Umair; Ali, Amjad; Khan, Dost Muhammad; Khan, Sajjad Ahmad (2017). Rules for resolving Mendelian inconsistencies in nuclear pedigrees typed for two-allele markers [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001792998
    Explore at:
    Dataset updated
    Mar 2, 2017
    Authors
    Manzoor, Sadaf; Alamgir; Khalil, Umair; Ali, Amjad; Khan, Dost Muhammad; Khan, Sajjad Ahmad
    Description

    Gene-mapping studies, regularly, rely on examination for Mendelian transmission of marker alleles in a pedigree as a way of screening for genotyping errors and mutations. For analysis of family data sets, it is, usually, necessary to resolve or remove the genotyping errors prior to consideration. At the Center of Inherited Disease Research (CIDR), to deal with their large-scale data flow, they formalized their data cleaning approach in a set of rules based on PedCheck output. We scrutinize via carefully designed simulations that how well CIDR’s data cleaning rules work in practice. We found that genotype errors in siblings are detected more often than in parents for less polymorphic SNPs and vice versa for more polymorphic SNPs. Through computer simulations, we conclude that some of the CIDR’s rules work poorly in some circumstances, and we suggest a set of modified data cleaning rules that may work better than CIDR’s rules.

  20. 101 flagellate phylogenomics data

    • figshare.com
    application/x-gzip
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guifré Torruella; Luis Javier Galindo; Purificación López-García; David Moreira (2024). 101 flagellate phylogenomics data [Dataset]. http://doi.org/10.6084/m9.figshare.22148027.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Guifré Torruella; Luis Javier Galindo; Purificación López-García; David Moreira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Phylogenomics dataset and the generated transcriptomic data for the study of 7 ancyromonads, 14 apusomonads and Meteora sporadica CRO19MET.Markers and supermatrices: phylogenomics_101_flagellates_97171aa.tar.gzRaw transcripts and peptides used for phylogenomics: 22_transcriptomes_brut.tar.gzTranscripts and peptides without cross-contamination due to batch extraction/sequencing: 22_transcriptomes_croco.tar.gzPeptides without bacterial contamination and redundancy. 22_transcriptomes_eukpep.tar.gzSRA in BioProject: PRJNA908224.Detailed explanation, read carefully before using these datasets:The scope of this study was to generate enough conserved phylogenomic markers to solve the species phylogeny of Apusomonadida and Ancyromonadida in the tree of eukaryotes (with the additional inclusion of the incertae sedis protist Meteora sporadica). For that, the original sets of de novo assembled transcripts from Spades (folder 01_transcripts_brut) were translated to proteins using TransDecoder and CD-HIT at 1% identity (folder 02_peptides_brut), and used to fill the phylogenomic dataset using BLASTp. As explained in the main text, they all 22 filled the dataset well (Table S1), and had high percentage of BUSCO completeness (Table S2); including higher than the reference apusomonad genome of Thecamonas trahens. We do not encourage the usage of this data brut sets unless all further analyses can be carefully checked in a case by case basis. Hence, with the aim to provide good quality data to the research community, we implemented a decontamination pipeline discussed below. From the original set of de novo assembled transcripts, CroCo detected most cross-contamination between the 1st sequencing batch (Table S3), which was also the one with more reads; > 10 million reads, compared to < 8 million reads in the 2nd and 3rd batches (Table S2). From the de-cross-contaminated transcripts (folder 03_transcripts_croco), the number of predicted peptides was much larger (from 26.19% to 68.81% more), except for Ancyromonas kenti who had around ten times more transcripts than other species (Table S2). This is because TransDecoder produces multiple peptides per transcript, which might not be real. After removing cross-contamination, the percentage of BUSCO completeness did not decrease for any species. There were some observed differences between taxa, such as apusomonads having more transcripts and peptides than ancyromonads, although it might be irrelevant to scrutinize partial transcriptomic data without genomics data backing up the results. Similarly, the 1st batch provided more transcripts and peptides than the 2nd and 3rd ones, probably because it had more reads to begin with. From that, we proceeded with only the peptides (folder 04_peptides_croco). Then, the supervised cleaning process with BAUVdb (Bacteria, Archaea, eUkaryotes and Viruses; Table S4) detected a low percentage of eukaryotic peptides: from 6.65% in Ancyromonas kenti, up to 17.11% in Fabomonas mesopelagica (folder 05_eukaryotic_peptides). The percentage of BUSCO completeness decreased for the subset with only eukaryotic hits, from only 0.4% in Chelonemonas dolani, up to 15.6% in Mylnikovia oxoniensis (the transcriptome with most peptides). Apusomonas proboscidea, due to being co-sequenced with a stramenopile, had 27% less of BUSCO completeness. On average, 7.2% of completeness decreased after cleaning the data from non-eukaryotic contaminants, which might represent a loss of truly eukaryotic peptides due to the limited taxon sampling of the BAUVdb (Table S2 and S4). Regarding the eggNOG-mapper analysis, only half of the peptides were annotated (55.63% on average), from 48.78% in Mylnikovia oxoniensis, up to 62.07% in Chelonemonas dolani. Altogether, the BUSCO completeness decreased by 4.2% in Chelonemonas geobuk, up to 19.4% in Ancyromonas mediterranea. Overall, we encourage anyone to use the subset of eukaryotic peptides for comparative genomics studies, in which the proteins under study can be easily checked. Since de novo transcriptomes are prone to show artificially duplicated peptides in comparative genomics analyses, we tested the peptide redundancy using CD-HIT to 90% identity. This procedure removed few peptides for most species (6.48% on average), except for the highly duplicated Mylnikovia oxoniensis (~42.1%), as well as for Multimonas media (20.51%), Apusomonas australiensis (15.5%) and Cavaliersmithia chaoae (9.57%). These four apusomonad species from the 1st sequencing batch are the ones with more transcripts and predicted peptides, but as other species from the batch, they have similar number of sequencing reads. As of now, it is not possible to discern between methodological issues or a biological meaning such as genome duplication or high alternative splicing to explain these differences. Interestingly, the BUSCO completeness value was identical for all species. Although the 255 markers for BUSCO are just a small subset of peptides, we suspect this process of reducing redundancy did not remove information, but errors during the processing of the data. We suggest users of this data to use this set (folder 06_eukpep_cdhit90pid) for high-throughput comparative genomics analyses, but always taking into account the information given here. Also, we did not observe any differences in terms of numbers of proteins, percentage of BUSCO completeness, or number of eggNOG annotated peptides between apusomonads and ancyromonads lineages. Neither with marine and freshwater organisms, nor between large and small apusomonads. Interestingly, we found that the subset of only eukaryote peptides reported from ~30% of BUSCO completeness using the bacteria db10 in Chelonemonas geobuk, up to 50% in Mylnikovia oxoniensis; a similar value found in the previously sequenced Thecamonas trahens refseq proteins (47.5%). In future studies, it would be interesting to compare these numbers with genomic data, and see how suited is RNAseq to perform further comparative genomics analyses.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
Organization logo

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

Search
Clear search
Close search
Google apps
Main menu