CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data cleaning tools market is experiencing robust growth, driven by the exponential increase in data volume and variety across industries. The rising need for high-quality data for accurate business intelligence, machine learning, and data-driven decision-making fuels demand for efficient and automated data cleaning solutions. While the precise market size in 2025 is unavailable, considering a conservative Compound Annual Growth Rate (CAGR) of 15% from a hypothetical 2019 market size of $5 billion (a reasonable starting point given the prevalence of data management needs), we can estimate the 2025 market size to be around $10 billion. This growth is further accelerated by trends like cloud adoption, the increasing sophistication of data cleaning algorithms (including AI and machine learning integration), and a growing awareness of data quality's impact on business outcomes. Leading players like Dundas BI, IBM, Sisense, and others are actively developing and enhancing their offerings to meet this demand. However, restraints such as the complexity of integrating data cleaning tools into existing systems and the need for skilled personnel to manage and utilize these tools continue to pose challenges. Segmentation within the market is likely to follow deployment models (cloud, on-premise), data types handled (structured, unstructured), and industry verticals (finance, healthcare, retail). The forecast period (2025-2033) suggests continued market expansion, propelled by further technological advancements and broader adoption across various sectors. The long-term projection anticipates a sustained CAGR, although it may moderate slightly as the market matures, potentially settling around 12-13% in the later years of the forecast. The competitive landscape is dynamic, with established players and emerging startups vying for market share. Companies are focusing on improving the usability and accessibility of their data cleaning tools, making them easier to integrate with other business intelligence platforms and enterprise systems. This integration will be vital for seamless data workflows and broader adoption. Strategic partnerships and acquisitions are likely to reshape the competitive dynamics in the years to come. Geographical variations in market maturity will influence regional growth rates, with regions like North America and Europe expected to maintain a strong presence, while Asia-Pacific and other emerging economies could see faster growth driven by increasing digitalization. Further research into specific regional data is needed to provide more precise figures and assess the localized market dynamics accurately.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we work on repairing three datasets:
country_protocol_code
, conduct the same clinical trials which is identified by eudract_number
. Each clinical trial has a title
that can help find informative details about the design of the trial.eudract_number
. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion
.code
. Samples with the same code
represent the same product but are extracted from a differentb source
. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients
in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients. N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Data Preparation Tools market is experiencing robust growth, projected to reach a significant market size by 2033. Driven by the exponential increase in data volume and variety across industries, coupled with the rising need for accurate, consistent data for effective business intelligence and machine learning initiatives, this sector is poised for continued expansion. The 18.5% Compound Annual Growth Rate (CAGR) signifies strong market momentum, fueled by increasing adoption across diverse sectors like IT and Telecom, Retail & E-commerce, BFSI (Banking, Financial Services, and Insurance), and Manufacturing. The preference for self-service data preparation tools empowers business users to directly access and prepare data, minimizing reliance on IT departments and accelerating analysis. Furthermore, the integration of data preparation tools with advanced analytics platforms and cloud-based solutions is streamlining workflows and improving overall efficiency. This trend is further augmented by the growing demand for robust data governance and compliance measures, necessitating sophisticated data preparation capabilities. While the market shows significant potential, challenges remain. The complexity of integrating data from multiple sources and maintaining data consistency across disparate systems present hurdles for many organizations. The need for skilled data professionals to effectively utilize these tools also contributes to market constraints. However, ongoing advancements in automation and user-friendly interfaces are mitigating these challenges. The competitive landscape is marked by established players like Microsoft, Tableau, and IBM, alongside innovative startups offering specialized solutions. This competitive dynamic fosters innovation and drives down costs, benefiting end-users. The market segmentation by application and tool type highlights the varied needs and preferences across industries, and understanding these distinctions is crucial for effective market penetration and strategic planning. Geographical expansion, particularly within rapidly developing economies in Asia-Pacific, will play a significant role in shaping the future trajectory of this thriving market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mean preservation of data (PD), sensitivity, specificity and convergence rate across different rates and types of simulated errors and duplications of uncleaned, de-duplicated and data cleaned with five data cleaning approaches with and without our algorithm (A) for longitudinal growth measurements from CLOSER data.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The data cleansing software market is expanding rapidly, with a market size of XXX million in 2023 and a projected CAGR of XX% from 2023 to 2033. This growth is driven by the increasing need for accurate and reliable data in various industries, including healthcare, finance, and retail. Key market trends include the growing adoption of cloud-based solutions, the increasing use of artificial intelligence (AI) and machine learning (ML) to automate the data cleansing process, and the increasing demand for data governance and compliance. The market is segmented by deployment type (cloud-based vs. on-premise) and application (large enterprises vs. SMEs vs. government agencies). Major players in the market include IBM, SAS Institute Inc, SAP SE, Trifacta, OpenRefine, Data Ladder, Analytics Canvas (nModal Solutions Inc.), Mo-Data, Prospecta, WinPure Ltd, Symphonic Source Inc, MuleSoft, MapR Technologies, V12 Data, and Informatica. This report provides a comprehensive overview of the global data cleansing software market, with a focus on market concentration, product insights, regional insights, trends, driving forces, challenges and restraints, growth catalysts, leading players, and significant developments.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data cleansing software market size is valued at XXX million in 2025 and is projected to reach USD XXX million by 2033, exhibiting a CAGR of XX% during the forecast period. The growing volume of data, increasing data complexity, and stringent data regulations are driving the adoption of data cleansing software. Moreover, the advancements in artificial intelligence (AI) and machine learning (ML) technologies are enhancing the capabilities of data cleansing tools, making them more efficient and accurate. The market is segmented by application, type, and region. Large enterprises hold a significant market share due to their extensive data processing needs. Cloud-based solutions are gaining popularity as they offer scalability, flexibility, and cost-effectiveness. North America and Europe are the prominent regions, owing to the presence of well-established IT infrastructure and stringent data protection laws. Key players in the market include IBM, SAS Institute Inc, SAP SE, Trifacta, OpenRefine, Data Ladder, Analytics Canvas(nModal Solutions Inc.), Mo-Data, Prospecta, WinPure Ltd, Symphonic Source Inc, MuleSoft, MapR Technologies, and V12 Data. These companies are investing in research and development to offer innovative data cleansing solutions that meet the evolving needs of businesses. Data cleansing involves identifying and correcting inaccuracies and inconsistencies in data. Amidst the rapid data proliferation, the demand for efficient data cleansing solutions has surged. The global data cleansing software market is estimated to reach $12.1 billion by 2028, growing at a CAGR of 10.5% from 2022 to 2028.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The original dataset found on Kaggle had fewer columns, some with 2 separate variables grouped together. Furthermore, the numbers in many of the data were string characters instead of int, since they were typed with numbers followed by words, for instance: Condition: 2 Accidents, 3 previous owners This one column was split into two separate columns - Accidents and Owners, and the string characters were removed and then the numbers were converted to integer type. Just like this example, many other columns have been modified, along with other cleaning and organizational techniques using python.
https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
BASE YEAR | 2024 |
HISTORICAL DATA | 2019 - 2024 |
REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
MARKET SIZE 2023 | 3.63(USD Billion) |
MARKET SIZE 2024 | 4.02(USD Billion) |
MARKET SIZE 2032 | 9.2(USD Billion) |
SEGMENTS COVERED | Deployment ,Organization Size ,Application ,Data Type ,Industry Vertical ,Regional |
COUNTRIES COVERED | North America, Europe, APAC, South America, MEA |
KEY MARKET DYNAMICS | Increasing Data Volumes Stringent Data Privacy Regulations Growing Need for Accurate Data Advancements in Artificial Intelligence CloudBased Deployment |
MARKET FORECAST UNITS | USD Billion |
KEY COMPANIES PROFILED | Melissa Data ,Oracle ,SAS Institute ,TransUnion ,Equifax ,Dun & Bradstreet ,Experian Data Quality ,Talend ,IBM ,Informatica ,Acxiom ,Experian ,SAP ,LexisNexis Risk Solutions |
MARKET FORECAST PERIOD | 2024 - 2032 |
KEY MARKET OPPORTUNITIES | 1 Cloudbased data cleansing 2 AIpowered data cleansing 3 Data privacy and compliance 4 Big data analytics 5 Selfservice data cleansing |
COMPOUND ANNUAL GROWTH RATE (CAGR) | 10.89% (2024 - 2032) |
https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
BASE YEAR | 2024 |
HISTORICAL DATA | 2019 - 2024 |
REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
MARKET SIZE 2023 | 2.67(USD Billion) |
MARKET SIZE 2024 | 2.95(USD Billion) |
MARKET SIZE 2032 | 6.5(USD Billion) |
SEGMENTS COVERED | Application, Deployment Type, End User, Features, Regional |
COUNTRIES COVERED | North America, Europe, APAC, South America, MEA |
KEY MARKET DYNAMICS | data quality improvement, regulatory compliance demand, cloud integration growth, advanced analytics adoption, increasing data volumes |
MARKET FORECAST UNITS | USD Billion |
KEY COMPANIES PROFILED | Trifacta, Melissa Data, Pitney Bowes, Microsoft, IBM, Dun and Bradstreet, Experian, Talend, Oracle, TIBCO Software, Informatica, Data Ladder, Precisely, SAP, SAS |
MARKET FORECAST PERIOD | 2025 - 2032 |
KEY MARKET OPPORTUNITIES | AI-driven automation integration, Rising demand for data quality, Increased regulatory compliance requirements, Expansion in e-commerce sectors, Growing adoption of cloud solutions |
COMPOUND ANNUAL GROWTH RATE (CAGR) | 10.38% (2025 - 2032) |
The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.
To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.
It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.
Face-to-face [f2f]
List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form
Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results
Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The MRO (Maintenance, Repair, and Operations) Data Cleansing and Enrichment Service market is experiencing robust growth, driven by the increasing need for accurate and reliable data across various industries. The digital transformation sweeping manufacturing, oil & gas, and transportation sectors is creating a surge in data volume, but much of this data is fragmented, incomplete, or inconsistent. This necessitates sophisticated data cleansing and enrichment solutions to improve operational efficiency, predictive maintenance capabilities, and informed decision-making. The market's expansion is fueled by the adoption of Industry 4.0 technologies, including IoT sensors and connected devices, generating massive datasets requiring rigorous cleaning and enrichment processes. Furthermore, regulatory compliance pressures and the need for improved supply chain visibility are contributing to strong market demand. We estimate the 2025 market size to be $2.5 billion, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is primarily driven by the Chemical, Oil & Gas, and Pharmaceutical industries' increasing reliance on data-driven insights for optimizing operations and reducing downtime. Significant regional variations exist, with North America and Europe currently holding the largest market shares, but rapid growth is anticipated in the Asia-Pacific region due to the increasing industrialization and digitalization initiatives underway. The market segmentation by application reveals a diverse landscape. The Chemical and Oil & Gas industries are early adopters, followed closely by Pharmaceuticals, leveraging data cleansing and enrichment to improve safety, comply with regulations, and optimize asset management. The Mining and Transportation sectors are also rapidly adopting these services to enhance operational efficiency and predictive maintenance. Within the types of services offered, data cleansing represents a larger share currently, focusing on identifying and removing inconsistencies and inaccuracies. However, data enrichment, which involves augmenting existing data with external sources to improve its completeness and context, is experiencing accelerated growth due to its capacity to unlock deeper insights. While several established players operate in the market, such as Enventure, Sphera, and OptimizeMRO, the landscape is also characterized by numerous smaller, specialized service providers, indicative of a competitive and dynamic market structure. The presence of regional players further suggests opportunities for both consolidation and expansion in the coming years.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This notebook focuses on cleaning and exploring a raw sales dataset provided by a local fashion brand. I performed:
Data cleaning (nulls, types, duplicates)
EDA (distribution, correlation)
Visualizations using Matplotlib, Seaborn, and Plotly
This dataset was provided by a fashion retail company and contains raw sales data used for cleaning, exploration, and visualization.
File Name: Train_csv.py.csv
Number of Rows: 10,000 (approx.)
Number of Columns: 12
File Format: CSV
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
An unclean copy of my GoodReads dataset (as for 2024/02/11) in csv format with 406 entries.
Data types included are integers, floats, strings, data/time and booleans (both in TRUE/FALSE and 0/1 formats).
This is a good dataset to practice cleaning and analysing as it contains missing values, inconsistent formats and outliers.
Disclaimer: Since GoodReads notifies you when there are duplicate entries, which meant I had no duplicate entries, I asked an AI to add 20 random duplicate entries to the data set for the purpose of this project.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data cleansing tools market is experiencing robust growth, driven by the escalating volume and complexity of data across various sectors. The increasing need for accurate and reliable data for decision-making, coupled with stringent data privacy regulations (like GDPR and CCPA), fuels demand for sophisticated data cleansing solutions. Businesses, regardless of size, are recognizing the critical role of data quality in enhancing operational efficiency, improving customer experiences, and gaining a competitive edge. The market is segmented by application (agencies, large enterprises, SMEs, personal use), deployment type (cloud, SaaS, web, installed, API integration), and geography, reflecting the diverse needs and technological preferences of users. While the cloud and SaaS models are witnessing rapid adoption due to scalability and cost-effectiveness, on-premise solutions remain relevant for organizations with stringent security requirements. The historical period (2019-2024) showed substantial growth, and this trajectory is projected to continue throughout the forecast period (2025-2033). Specific growth rates will depend on technological advancements, economic conditions, and regulatory changes. Competition is fierce, with established players like IBM, SAS, and SAP alongside innovative startups continuously improving their offerings. The market's future depends on factors such as the evolution of AI and machine learning capabilities within data cleansing tools, the increasing demand for automated solutions, and the ongoing need to address emerging data privacy challenges. The projected Compound Annual Growth Rate (CAGR) suggests a healthy expansion of the market. While precise figures are not provided, a realistic estimate based on industry trends places the market size at approximately $15 billion in 2025. This is based on a combination of existing market reports and understanding of the growth of related fields (such as data analytics and business intelligence). This substantial market value is further segmented across the specified geographic regions. North America and Europe currently dominate, but the Asia-Pacific region is expected to exhibit significant growth potential driven by increasing digitalization and adoption of data-driven strategies. The restraints on market growth largely involve challenges related to data integration complexity, cost of implementation for smaller businesses, and the skills gap in data management expertise. However, these are being countered by the emergence of user-friendly tools and increased investment in data literacy training.
The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.
Palestine, West Bank, Gaza strip
Household, Individual
All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.
Sample survey data [ssd]
Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.
Sample size The estimated sample size is 8,040 households.
Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.
Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).
Computer Assisted Personal Interview [capi]
Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.
Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.
Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.
Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.
Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.
In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.
Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.
Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.
The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.
Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.
Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.
The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
This study examines various dimensions of primary health care delivery in Uganda, using a baseline survey of public and private dispensaries, the most common lower level health facilities in the country.
The survey was designed and implemented by the World Bank in collaboration with the Makerere Institute for Social Research and the Ugandan Ministries of Health and of Finance, Planning and Economic Development. It was carried out in October - December 2000 and covered 155 local health facilities and seven district administrations in ten districts. In addition, 1617 patients exiting health facilities were interviewed. Three types of dispensaries (both with and without maternity units) were included: those run by the government, by private for-profit providers, and by private nonprofit providers, mainly religious.
This research is a Quantitative Service Delivery Survey (QSDS). It collected microlevel data on service provision and analyzed health service delivery from a public expenditure perspective with a view to informing expenditure and budget decision-making, as well as sector policy.
Objectives of the study included:
1) Measuring and explaining the variation in cost-efficiency across health units in Uganda, with a focus on the flow and use of resources at the facility level;
2) Diagnosing problems with facility performance, including the extent of drug leakage, as well as staff performance and availability;
3) Providing information on pricing and user fee policies and assessing the types of service actually provided;
4) Shedding light on the quality of service across the three categories of service provider - government, for-profit, and nonprofit;
5) Examining the patterns of remuneration, pay structure, and oversight and monitoring and their effects on health unit performance;
6) Assessing the private-public partnership, particularly the program of financial aid to nonprofits.
The study districts were Mpigi, Mukono, and Masaka in the central region; Mbale, Iganga, and Soroti in the east; Arua and Apac in the north; and Mbarara and Bushenyi in the west.
The survey covered government, for-profit and nonprofit private dispensaries with or without maternity units in ten Ugandan districts.
Sample survey data [ssd]
The survey covered government, for-profit and nonprofit private dispensaries with or without maternity units in ten Ugandan districts.
The sample design was governed by three principles. First, to ensure a degree of homogeneity across sampled facilities, attention was restricted to dispensaries, with and without maternity units (that is, to the health center III level). Second, subject to security constraints, the sample was intended to capture regional differences. Finally, the sample had to include facilities in the main ownership categories: government, private for-profit, and private nonprofit (religious organizations and NGOs). The sample of government and nonprofit facilities was based on the Ministry of Health facility register for 1999. Since no nationwide census of for-profit facilities was available, these facilities were chosen by asking sampled government facilities to identify the closest private dispensary.
Of the 155 health facilities surveyed, 81 were government facilities, 30 were private for-profit facilities, and 44 were nonprofit facilities. An exit poll of clients covered 1,617 individuals.
The final sample consisted of 155 primary health care facilities drawn from ten districts in the central, eastern, northern, and western regions of the country. It included government, private for-profit, and private nonprofit facilities. The nonprofit sector includes facilities owned and operated by religious organizations and NGOs. Approximately one third of the surveyed facilities were dispensaries without maternity units; the rest provided maternity care. The facilities varied considerably in size, from units run by a single individual to facilities with as many as 19 staff members.
Ministry of Health facility register for 1999 was used to design the sampling frame. Ten districts were randomly selected. From the selected districts, a sample of government and private nonprofit facilities and a reserve list of replacement facilities were randomly drawn. Because of the unreliability of the register for private for-profit facilities, it was decided that for-profit facilities would be identified on the basis of information from the government facilities sampled. The administrative records for facilities in the original sample were first reviewed at the district headquarters, where some facilities that did not meet selection criteria and data collection requirements were dropped from the sample. These were replaced by facilities from the reserve list. Overall, 30 facilities were replaced.
The sample was designed in such a way that the proportion of facilities drawn from different regions and ownership categories broadly mirrors that of the universe of facilities. Because no nationwide census of for-profit health facilities is available, it is difficult to assess the extent to which the sample is representative of this category. A census of health care facilities in selected districts, carried out in the context of the Delivery of Improved Services for Health (DISH) project supported by the U.S. Agency for International Development (USAID), suggests that about 63 percent of all facilities operate on a for-profit basis, while government and nonprofit providers run 26 and 11 percent of facilities, respectively. This would suggest an undersampling of private providers in the survey. It is not clear, however, whether the DISH districts are representative of other districts in Uganda in terms of the market for health care.
For the exit poll, 10 interviews per facility were carried out in approximately 85 percent of the facilities. In the remaining facilities the target of 10 interviews was not met, as a result of low activity levels.
In the first stage in the sampling process, eight districts (out of 45) had to be dropped from the sample frame due to security concerns. These districts were Bundibugyo, Gulu, Kabarole, Kasese, Kibaale, Kitgum, Kotido, and Moroto.
Face-to-face [f2f]
The following survey instruments are available:
The survey collected data at three levels: district administration, health facility, and client. In this way it was possible to capture central elements of the relationships between the provider organization, the frontline facility, and the user. In addition, comparison of data from different levels (triangulation) permitted cross-validation of information.
At the district level, a District Health Team Questionnaire was administered to the district director of health services (DDHS), who was interviewed on the role of the DDHS office in health service delivery. Specifically, the questionnaire collected data on health infrastructure, staff training, support and supervision arrangements, and sources of financing.
The District Facility Data Sheet was used at the district level to collect more detailed information on the sampled health units for fiscal 1999-2000, including data on staffing and the related salary structures, vaccine supplies and immunization activity, and basic and supplementary supplies of drugs to the facilities. In addition, patient data, including monthly returns from facilities on total numbers of outpatients, inpatients, immunizations, and deliveries, were reviewed for the period April-June 2000.
At the facility level, the Uganda Health Facility Survey Questionnaire collected a broad range of information related to the facility and its activities. The questionnaire, which was administered to the in-charge, covered characteristics of the facility (location, type, level, ownership, catchment area, organization, and services); inputs (staff, drugs, vaccines, medical and nonmedical consumables, and capital inputs); outputs (facility utilization and referrals); financing (user charges, cost of services by category, expenditures, and financial and in-kind support); and institutional support (supervision, reporting, performance assessment, and procurement). Each health facility questionnaire was supplemented by a Facility Data Sheet (FDS). The FDS was designed to obtain data from the health unit records on staffing and the related salary structure; daily patient records for fiscal 1999-2000; the type of patients using the facility; vaccinations offered; and drug supply and use at the facility.
Finally, at the facility level, an exit poll was used to interview about 10 patients per facility on the cost of treatment, drugs received, perceived quality of services, and reasons for using that unit instead of alternative sources of health care.
Detailed information about data editing procedures is available in "Data Cleaning Guide for PETS/QSDS Surveys" in external resources.
STATA cleaning do-files and the data quality reports on the datasets can also be found in external resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.