CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning. Each chapter has a demo folder for demonstrating provenance queries or tools. The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website. Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo
Data Science Platform Market Size 2025-2029
The data science platform market size is forecast to increase by USD 763.9 million at a CAGR of 40.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to gain valuable insights from their data more efficiently and effectively, leading to improved decision-making and operational efficiency. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. These technologies offer increased flexibility, scalability, and ease of deployment, making it simpler for businesses to implement and manage their data science initiatives. However, the market is not without challenges. Data privacy and security remain critical concerns, as the use of data science platforms involves handling large volumes of sensitive data.
Ensuring security measures and adhering to data protection regulations are essential for companies seeking to capitalize on the opportunities presented by this dynamic market. Companies must navigate these challenges while staying abreast of emerging trends and technologies to remain competitive and deliver value to their customers.
What will be the Size of the Data Science Platform Market during the forecast period?
Request Free Sample
The market encompasses a range of software applications that facilitate various stages of the data science workflow, from data acquisition and preprocessing to machine learning model development, training, and distribution. This market is driven by the increasing demand for data exploration and analysis across industries, fueled by the proliferation of machine data from IoT devices and the availability of big data from various sources, including multimedia, business, and consumer data. Data scientists require comprehensive tools to manage the complete life cycle of their projects, from data preparation and cleaning to visualization and modeling. Cloud-based solutions have gained significant traction due to their flexibility and scalability, enabling users to process and analyze large volumes of unstructured and structured data using relational databases and artificial intelligence (AI) and machine learning (ML) techniques.
The market is expected to grow substantially due to the rising adoption of ML models and the need for efficient model development, training, and deployment. Preprocessing, data cleaning, and model distribution are critical components of this market, ensuring the accuracy and reliability of ML models and their seamless integration into various applications. Overall, the market is a dynamic and evolving landscape, offering numerous opportunities for businesses to leverage AI and ML technologies for data-driven insights and decision-making.
How is this Data Science Platform Industry segmented?
The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Deployment
On-premises
Cloud
Component
Platform
Services
End-user
BFSI
Retail and e-commerce
Manufacturing
Media and entertainment
Others
Sector
Large enterprises
SMEs
Application
Data Preparation
Data Visualization
Machine Learning
Predictive Analytics
Data Governance
Others
Geography
North America
US
Canada
Europe
France
Germany
UK
APAC
China
India
Japan
South America
Brazil
Middle East and Africa
UAE
Rest of World (ROW)
By Deployment Insights
The on-premises segment is estimated to witness significant growth during the forecast period. In today's data-driven business landscape, organizations are continually seeking innovative solutions to manage and leverage their structured and unstructured data. While cloud-based solutions have gained popularity for their scalability and cost-effectiveness, on-premises deployment remains a preferred choice for enterprise types with stringent data security requirements. On-premises deployment offers several advantages, including quick adaptation to corporate needs, data security, and the elimination of third-party data maintenance and security concerns. With on-premises software, businesses can avoid data transfer over the internet, ensuring data privacy and confidentiality. Moreover, on-premises solutions enable easy and rapid data access, allowing employees to make data-driven decisions in real-time.
However, on-premises deployment comes with its challenges, such as a lack of workforce with the necessary data skills and technical expertise for model development, deployment, and integration. To address thes
What is Account-Based-Marketing? Account-based marketing, or ABM, is a business strategy that focuses your resources on a specific segment of customer accounts. It's all about understanding your customers on a personal level and delivering personalized campaigns that resonate with their needs and preferences.
Why should you use Thomson Data’s Data solution for Account Based Marketing (ABM)? Utilizing Account-based marketing data for your marketing campaign might seem like a long-draw-out approach, but it is absolutely worth the hassle.
Here are some of the benefits you will definitely be interested in.
Boost Lead Generation: Our database is designed for effective account-based marketing that will boost lead generation. We enable you to target specific accounts, and our data insights will help you tailor the messages according to their needs and pain points.
Retain Email Subscribers: Retaining your subscribers is also a concerning challenge. Using our database for account-based marketing will help you to connect with your clients on a personal level. Enabling you to keep them engaged will encourage these clients to consider your products and services whenever they need one.
Increases profits: As Thomson Data’s records heighten the tone for personalization, you can connect with your prospective clientele on a personal level. When you do it in the right way, it is significantly reflected in your sales figures.
Gain Insights: Get 100+ insights from our data to make better decision making and implement in your Account based marketing strategies.
Our ABM data can be used for improving your conversions by 3x times.
Our Account based marketing data can be used by: 1. B2b companies 2. Sales Teams 3. Marketing Teams 4. C- suite Executives 5. Agencies and Service providers 6. Enterprise Level Organizations and more.
Thomson Data is perfect for ABM and will certainly help you run campaigns that target customer acquisition as well as customer retention. We provide you an access to the complete data solution to help you connect and impress your target audience.
Send us a request to know more details about our Account based marketing data and we will be happy to assist you.
The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2023. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.
Palestine, West Bank, Gaza strip
Household, Individual
All Palestinian households and individuals (10 years and above) whose usual place of residence in 2023 was in the state of Palestine.
Sample survey data [ssd]
Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.
Sample Size The sample size is 8,040 households.
Sampling Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.
Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, camps).
Computer Assisted Personal Interview [capi]
Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.
Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.
Section III: Data on Individuals (10 years and above) about computer use, access to the Internet, possession of a mobile phone, information threats, and E-commerce.
Field Editing and Supervising
• Data collection and coordination were carried out in the field according to the pre-prepared plan, where instructions, models and tools were available for fieldwork. • Audit process on the PC-Tablet is through the establishment of all automated rules and the office on the program to cover all the required controls according to the criteria specified. • For the privacy of Jerusalem (J1) data were collected in a paper questionnaire. Then the supervisor verifies the questionnaire in a formal and technical manner according to the pre-prepared audit rules. • Fieldwork visits was carried out by the project coordinator, supervisors and project management to check edited questionnaire and the performance of fieldworkers.
Data Processing
Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.
Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.
In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.
Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.
The response rate reached 83.7%.
Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, there is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.
Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.
The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non-response cases. The total non-response rate reached 16.3%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This CNVVE Dataset contains clean audio samples encompassing six distinct classes of voice expressions, namely “Uh-huh” or “mm-hmm”, “Uh-uh” or “mm-mm”, “Hush” or “Shh”, “Psst”, “Ahem”, and Continuous humming, e.g., “hmmm.” Audio samples of each class are found in the respective folders. These audio samples have undergone a thorough cleaning process. The raw samples are published in https://doi.org/10.18419/darus-3897. Initially, we applied the Google WebRTC voice activity detection (VAD) algorithm on the given audio files to remove noise or silence from the collected voice signals. The intensity was set to "2", which could be a value between "1" and "3". However, because of variations in the data, some files required additional manual cleaning. These outliers, characterized by sharp click sounds (such as those occurring at the end of recordings), were addressed. The samples are recorded through a dedicated website for data collection that defines the purpose and type of voice data by providing example recordings to participants as well as the expressions’ written equivalent, e.g., “Uh-huh”. Audio recordings were automatically saved in the .wav format and kept anonymous, with a sampling rate of 48 kHz and a bit depth of 32 bits. For more info, please check the paper or feel free to contact the authors for any inquiries.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Company Datasets for valuable business insights!
Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.
These datasets are sourced from top industry providers, ensuring you have access to high-quality information:
We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:
You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.
Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.
With Oxylabs Datasets, you can count on:
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Instagram data-download example dataset
In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).
How the data was generated
These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.
The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.
Obtaining your Instagram DDP
After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.
Instagram then delivered the data in a compressed zip folder with the format username_YYYYMMDD.zip (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.
Data cleaning
To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.
How this data-set can be used
This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.
Authors
The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.
Acknowledgments
The researchers would like to thank everyone who participated in this data-generation project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599
Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
The 2003 Agriculture Sample Census was designed to meet the data needs of a wide range of users down to district level including policy makers at local, regional and national levels, rural development agencies, funding institutions, researchers, NGOs, farmer organisations, etc. As a result the dataset is both more numerous in its sample and detailed in its scope compared to previous censuses and surveys. To date this is the most detailed Agricultural Census carried out in Africa.
The census was carried out in order to: · Identify structural changes if any, in the size of farm household holdings, crop and livestock production, farm input and implement use. It also seeks to determine if there are any improvements in rural infrastructure and in the level of agriculture household living conditions; · Provide benchmark data on productivity, production and agricultural practices in relation to policies and interventions promoted by the Ministry of Agriculture and Food Security and other stake holders. · Establish baseline data for the measurement of the impact of high level objectives of the Agriculture Sector Development Programme (ASDP), National Strategy for Growth and Reduction of Poverty (NSGRP) and other rural development programs and projects. · Obtain benchmark data that will be used to address specific issues such as: food security, rural poverty, gender, agro-processing, marketing, service delivery, etc.
Tanzania Mainland and Zanzibar
Large scale, small scale and community farms.
Census/enumeration data [cen]
The Mainland sample consisted of 3,221 villages. These villages were drawn from the National Master Sample (NMS) developed by the National Bureau of Statistics (NBS) to serve as a national framework for the conduct of household based surveys in the country. The National Master Sample was developed from the 2002 Population and Housing Census. The total Mainland sample was 48,315 agricultural households. In Zanzibar a total of 317 enumeration areas (EAs) were selected and 4,755 agriculture households were covered. Nationwide, all regions and districts were sampled with the exception of three urban districts (two from Mainland and one from Zanzibar).
In both Mainland and Zanzibar, a stratified two stage sample was used. The number of villages/EAs selected for the first stage was based on a probability proportional to the number of villages in each district. In the second stage, 15 households were selected from a list of farming households in each selected Village/EA, using systematic random sampling, with the village chairpersons assisting to locate the selected households.
Face-to-face [f2f]
The census covered agriculture in detail as well as many other aspects of rural development and was conducted using three different questionnaires: • Small scale questionnaire • Community level questionnaire • Large scale farm questionnaire
The small scale farm questionnaire was the main census instrument and it includes questions related to crop and livestock production and practices; population demographics; access to services, resources and infrastructure; and issues on poverty, gender and subsistence versus profit making production unit.
The community level questionnaire was designed to collect village level data such as access and use of common resources, community tree plantation and seasonal farm gate prices.
The large scale farm questionnaire was administered to large farms either privately or corporately managed.
Questionnaire Design The questionnaires were designed following user meetings to ensure that the questions asked were in line with users data needs. Several features were incorporated into the design of the questionnaires to increase the accuracy of the data: • Where feasible all variables were extensively coded to reduce post enumeration coding error. • The definitions for each section were printed on the opposite page so that the enumerator could easily refer to the instructions whilst interviewing the farmer. • The responses to all questions were placed in boxes printed on the questionnaire, with one box per character. This feature made it possible to use scanning and Intelligent Character Recognition (ICR) technologies for data entry. • Skip patterns were used to reduce unnecessary and incorrect coding of sections which do not apply to the respondent. • Each section was clearly numbered, which facilitated the use of skip patterns and provided a reference for data type coding for the programming of CSPro, SPSS and the dissemination applications.
Data processing consisted of the following processes: · Data entry · Data structure formatting · Batch validation · Tabulation
Data Entry Scanning and ICR data capture technology for the small holder questionnaire were used on the Mainland. This not only increased the speed of data entry, it also increased the accuracy due to the reduction of keystroke errors. Interactive validation routines were incorporated into the ICR software to track errors during the verification process. The scanning operation was so successful that it is highly recommended for adoption in future censuses/surveys. In Zanzibar all data was entered manually using CSPro.
Prior to scanning, all questionnaires underwent a manual cleaning exercise. This involved checking that the questionnaire had a full set of pages, correct identification and good handwriting. A score was given to each questionnaire based on the legibility and the completeness of enumeration. This score will be used to assess the quality of enumeration and supervision in order to select the best field staff for future censuses/surveys.
CSPro was used for data entry of all Large Scale Farm and community based questionnaires due to the relatively small number of questionnaires. It was also used to enter data from the 2,880 small holder questionnaires that were rejected by the ICR extraction application.
Data Structure Formatting A program was developed in visual basic to automatically alter the structure of the output from the scanning/extraction process in order to harmonise it with the manually entered data. The program automatically checked and changed the number of digits for each variable, the record type code, the number of questionnaires in the village, the consistency of the Village ID Code and saved the data of one village in a file named after the village code.
Batch Validation A batch validation program was developed in order to identify inconsistencies within a questionnaire. This is in addition to the interactive validation during the ICR extraction process. The procedures varied from simple range checking within each variable to the more complex checking between variables. It took six months to screen, edit and validate the data from the smallholder questionnaires. After the long process of data cleaning, tabulations were prepared based on a pre-designed tabulation plan.
Tabulations Statistical Package for Social Sciences (SPSS) was used to produce the Census tabulations and Microsoft Excel was used to organize the tables and compute additional indicators. Excel was also used to produce charts while ArcView and Freehand were used for the maps.
Analysis and Report Preparation The analysis in this report focuses on regional comparisons, time series and national production estimates. Microsoft Excel was used to produce charts; ArcView and Freehand were used for maps, whereas Microsoft Word was used to compile the report.
Data Quality A great deal of emphasis was placed on data quality throughout the whole exercise from planning, questionnaire design, training, supervision, data entry, validation and cleaning/editing. As a result of this, it is believed that the census is highly accurate and representative of what was experienced at field level during the Census year. With very few exceptions, the variables in the questionnaire are within the norms for Tanzania and they follow expected time series trends when compared to historical data. Standard Errors and Coefficients of Variation for the main variables are presented in the Technical Report (Volume I).
The Sampling Error found on page (21) up to page (22) in the Technical Report for Agriculture Sample Census Survey 2002-2003
National Labor Force Survey (SAKERNAS) is a survey that is designed to observe the general situation of workforce and also to understand whether there is a change of workforce structure between the enumeration period. Since the survey was initiated in 1976, it has undergone a series of changes affecting its coverage, the frequency of enumeration, the number of households sampled and the type of information collected. It is the largest and most representative source of employment data in Indonesia. For each selected household, the general information about the circumstances of each household member that includes the name, relationship to head of household, sex, and age were collected. Household members aged 10 years and over will be prompted to give the information about their marital status, education and employment.
SAKERNAS is aimed to gather informations that meet three objectives: 1.Employment by education, working hours, industrial classification and employment status, 2.Unemployment and underemployment by different characteristics and efforts on looking for work, 3.Working age population not in the labor force (e.g. attending schools, doing housekeeping and others).
The data for quarterly SAKERNAS was gathered in 1989 covered all provinces in Indonesia, with 65,440 households, scattered both in rural and urban areas and representative until provincial level. The main household data is taken from core questionnaire of SAK89-AK.
National coverage* including urban and rural area, representative until provincial level.
*) Although covering all of Indonesia, there are some circumstances when not all provincial were covered. For example, in year 2000, the Province of Maluku excluded in SAKERNAS because horizontal conflicts occurred there. Also, the separation of East Timor from Indonesia in year 1999 also changed the scope of SAKERNAS for the years to come. After that, due to the expansion of regional autonomy as a consequence, the proportion of samples per Province is also changed, as in 2006 when the number of provinces are already 33. However, the difference is only on the number of influential scope/level but not to the pattern. On the other hand, changes in the methodology (including sample size) over time is likely to affect the outcome, for example in years 2000 and 2001, when sample size is only 32.384 and 34.176 the level of data presentation is only representative to island level, (insufficient sample size even to make it representative to provincial level).
Individual
The survey covered all de jure household members (usual residents), aged 10 years and over that resident in the household. However, Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample.
Sample survey data
Quarterly SAKERNAS 1989 was implemented in the whole territory of the Republic of Indonesia , with a total sample of about 65,440 households, both in rural and urban areas and representative until provincial level. Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample. Data in the dataset indicates the combined sample data consisting results of the 4 rounds quarterly SAKERNAS in 1989, i.e. quarter I, quarter II, quarter III, and quarter IV.
Implementation of SAKERNAS 1989 include samples of the previous enumeration activities (rotation method). Sampling method* to be used is similar for implementation of SAKERNAS years 1986 to 1989, which households selected samples from previous quarter will be partly re-enumerated and then again partly from other household ever elected from another previous quarters, so no need to re-enroll in new household. The procedure for the selection of households in the sample are described in more detail in the enumerators/ supervisors manual document.
*) Sampling method used is varied in different years. For example, in SAKERNAS period of 1986-1989 sampling method used is the method of rotation, where most of the households selected at one period was re-elected in the following period. This often happens on quarterly SAKERNAS on that period. At other periods often use multi-stages sampling method (two or three stages depend on whether sub block census / segment group included or not), or a combination of multi stages sampling also with rotation method (e.g. SAKERNAS 2006-2010).
Face-to-face
In SAKERNAS, the questionnaire has been designed in a simple and concise way. It is expected that respondents will understand the aim of question of survey and avoid the memory lapse and uninterested respondents during data collection. Furthermore, the design of SAKERNAS's questionnaire remains stable in order to maintain data comparison.
A household questionnaire was administered in each selected household, which collected general information of household members that includes name, relationship with head of the household, sex and age. Household members aged 10 years and over were then asked about their marital status, education and occupation.
Stages of data processing in Sakernas are through process of: - Batching - Editing - Coding - Data Entry - Validation - Tabulation
Sampling error results are presented at the end of the publication of The State of Labor Force in Indonesia and in publication of The State of Workers in Indonesia.
This study examines various dimensions of primary health care delivery in Uganda, using a baseline survey of public and private dispensaries, the most common lower level health facilities in the country.
The survey was designed and implemented by the World Bank in collaboration with the Makerere Institute for Social Research and the Ugandan Ministries of Health and of Finance, Planning and Economic Development. It was carried out in October - December 2000 and covered 155 local health facilities and seven district administrations in ten districts. In addition, 1617 patients exiting health facilities were interviewed. Three types of dispensaries (both with and without maternity units) were included: those run by the government, by private for-profit providers, and by private nonprofit providers, mainly religious.
This research is a Quantitative Service Delivery Survey (QSDS). It collected microlevel data on service provision and analyzed health service delivery from a public expenditure perspective with a view to informing expenditure and budget decision-making, as well as sector policy.
Objectives of the study included:
1) Measuring and explaining the variation in cost-efficiency across health units in Uganda, with a focus on the flow and use of resources at the facility level;
2) Diagnosing problems with facility performance, including the extent of drug leakage, as well as staff performance and availability;
3) Providing information on pricing and user fee policies and assessing the types of service actually provided;
4) Shedding light on the quality of service across the three categories of service provider - government, for-profit, and nonprofit;
5) Examining the patterns of remuneration, pay structure, and oversight and monitoring and their effects on health unit performance;
6) Assessing the private-public partnership, particularly the program of financial aid to nonprofits.
The study districts were Mpigi, Mukono, and Masaka in the central region; Mbale, Iganga, and Soroti in the east; Arua and Apac in the north; and Mbarara and Bushenyi in the west.
The survey covered government, for-profit and nonprofit private dispensaries with or without maternity units in ten Ugandan districts.
Sample survey data [ssd]
The survey covered government, for-profit and nonprofit private dispensaries with or without maternity units in ten Ugandan districts.
The sample design was governed by three principles. First, to ensure a degree of homogeneity across sampled facilities, attention was restricted to dispensaries, with and without maternity units (that is, to the health center III level). Second, subject to security constraints, the sample was intended to capture regional differences. Finally, the sample had to include facilities in the main ownership categories: government, private for-profit, and private nonprofit (religious organizations and NGOs). The sample of government and nonprofit facilities was based on the Ministry of Health facility register for 1999. Since no nationwide census of for-profit facilities was available, these facilities were chosen by asking sampled government facilities to identify the closest private dispensary.
Of the 155 health facilities surveyed, 81 were government facilities, 30 were private for-profit facilities, and 44 were nonprofit facilities. An exit poll of clients covered 1,617 individuals.
The final sample consisted of 155 primary health care facilities drawn from ten districts in the central, eastern, northern, and western regions of the country. It included government, private for-profit, and private nonprofit facilities. The nonprofit sector includes facilities owned and operated by religious organizations and NGOs. Approximately one third of the surveyed facilities were dispensaries without maternity units; the rest provided maternity care. The facilities varied considerably in size, from units run by a single individual to facilities with as many as 19 staff members.
Ministry of Health facility register for 1999 was used to design the sampling frame. Ten districts were randomly selected. From the selected districts, a sample of government and private nonprofit facilities and a reserve list of replacement facilities were randomly drawn. Because of the unreliability of the register for private for-profit facilities, it was decided that for-profit facilities would be identified on the basis of information from the government facilities sampled. The administrative records for facilities in the original sample were first reviewed at the district headquarters, where some facilities that did not meet selection criteria and data collection requirements were dropped from the sample. These were replaced by facilities from the reserve list. Overall, 30 facilities were replaced.
The sample was designed in such a way that the proportion of facilities drawn from different regions and ownership categories broadly mirrors that of the universe of facilities. Because no nationwide census of for-profit health facilities is available, it is difficult to assess the extent to which the sample is representative of this category. A census of health care facilities in selected districts, carried out in the context of the Delivery of Improved Services for Health (DISH) project supported by the U.S. Agency for International Development (USAID), suggests that about 63 percent of all facilities operate on a for-profit basis, while government and nonprofit providers run 26 and 11 percent of facilities, respectively. This would suggest an undersampling of private providers in the survey. It is not clear, however, whether the DISH districts are representative of other districts in Uganda in terms of the market for health care.
For the exit poll, 10 interviews per facility were carried out in approximately 85 percent of the facilities. In the remaining facilities the target of 10 interviews was not met, as a result of low activity levels.
In the first stage in the sampling process, eight districts (out of 45) had to be dropped from the sample frame due to security concerns. These districts were Bundibugyo, Gulu, Kabarole, Kasese, Kibaale, Kitgum, Kotido, and Moroto.
Face-to-face [f2f]
The following survey instruments are available:
The survey collected data at three levels: district administration, health facility, and client. In this way it was possible to capture central elements of the relationships between the provider organization, the frontline facility, and the user. In addition, comparison of data from different levels (triangulation) permitted cross-validation of information.
At the district level, a District Health Team Questionnaire was administered to the district director of health services (DDHS), who was interviewed on the role of the DDHS office in health service delivery. Specifically, the questionnaire collected data on health infrastructure, staff training, support and supervision arrangements, and sources of financing.
The District Facility Data Sheet was used at the district level to collect more detailed information on the sampled health units for fiscal 1999-2000, including data on staffing and the related salary structures, vaccine supplies and immunization activity, and basic and supplementary supplies of drugs to the facilities. In addition, patient data, including monthly returns from facilities on total numbers of outpatients, inpatients, immunizations, and deliveries, were reviewed for the period April-June 2000.
At the facility level, the Uganda Health Facility Survey Questionnaire collected a broad range of information related to the facility and its activities. The questionnaire, which was administered to the in-charge, covered characteristics of the facility (location, type, level, ownership, catchment area, organization, and services); inputs (staff, drugs, vaccines, medical and nonmedical consumables, and capital inputs); outputs (facility utilization and referrals); financing (user charges, cost of services by category, expenditures, and financial and in-kind support); and institutional support (supervision, reporting, performance assessment, and procurement). Each health facility questionnaire was supplemented by a Facility Data Sheet (FDS). The FDS was designed to obtain data from the health unit records on staffing and the related salary structure; daily patient records for fiscal 1999-2000; the type of patients using the facility; vaccinations offered; and drug supply and use at the facility.
Finally, at the facility level, an exit poll was used to interview about 10 patients per facility on the cost of treatment, drugs received, perceived quality of services, and reasons for using that unit instead of alternative sources of health care.
Detailed information about data editing procedures is available in "Data Cleaning Guide for PETS/QSDS Surveys" in external resources.
STATA cleaning do-files and the data quality reports on the datasets can also be found in external resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
https://i.imgur.com/PcSDv8A.png" alt="Imgur">
The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.
Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.
The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.
The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.
The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.
This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes.
https://i.imgur.com/HOtyghv.png" alt="Imgur">
US Commercial and Residential Cleaning Services Market Size 2025-2029
The US commercial & residential cleaning services market size is forecast to increase by USD 37.8 billion at a CAGR of 5.9% between 2024 and 2029.
The US Commercial and Residential Cleaning Services Market is experiencing significant growth, driven by the increasing demand for professional cleaning services in both sectors. One key trend is the rising popularity of multifamily dwellings, which present a substantial opportunity for market expansion. Additionally, strategic alliances between industry players are increasingly common, enabling companies to broaden their reach and enhance their offerings. However, the market is not without challenges, most notably the fluctuations in labor wages, which can impact profitability and operational efficiency.
To capitalize on market opportunities and navigate challenges effectively, companies must stay informed of industry trends and adapt to the evolving landscape. By focusing on innovation, strategic partnerships, and cost management, they can differentiate themselves and maintain a competitive edge.
What will be the size of the US Commercial And Residential Cleaning Services Market during the forecast period?
Request Free Sample
The cleaning services market encompasses both commercial and residential properties, catering to the essential duties of maintaining hygiene, health, and cleanliness. In the US, this market exhibits significant activity, driven by the varying cleaning needs of diverse facility types. Commercial properties, including offices, cleanrooms, and industrial spaces, prioritize general cleaning, deep cleaning, and specialized technology to meet stringent sanitary requirements. Residential properties require equally important cleaning services, focusing on customer experience and trained cleaners. Cleaning methods and techniques continue to evolve, with an emphasis on advanced sanitizing and disinfection processes. Cleaning companies invest in innovative cleaning equipment and supplies to meet the demands of their clients.
Industrial cleaning services ensure the highest cleaning standards in large-scale facilities, while specialized cleaning companies cater to unique needs, such as medical and healthcare facilities. The cleaning services market is a critical component of maintaining a clean and healthy environment, ensuring businesses and homes operate efficiently and effectively. The market's continued growth is a testament to the importance of cleanliness and the ongoing demand for professional cleaning services.
How is this market segmented?
The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Sector
Commercial
Residential
Service Type
Janitorial services
Carpet and upholstery cleaning services
Outdoor areas
Others
Technique
Traditional techniques
Eco-friendly techniques
End-User
Households
Offices
Healthcare Facilities
Retail
Service Mode
One-Time
Recurring (Daily, Weekly, Monthly)
Seasonal
Geography
North America
US
By Sector Insights
The commercial segment is estimated to witness significant growth during the forecast period.
The commercial and residential cleaning services market in the US caters to various facility types, including offices, cleanrooms, medical facilities, schools, commercial kitchens, and residential properties. The commercial segment is driven by the need for maintaining hygiene and health in workplaces and healthcare establishments. The cleaning duties for commercial properties involve general cleaning, deep cleaning, sanitizing, and disinfection using industrial-grade equipment and cleaning supplies. The residential segment focuses on household cleaning tools for domestic dwellings, ensuring the quality of cleaning, dependability, and customer experience. The cleaning frequency and intensity vary based on the facility type and cleaning needs.
Trained cleaners employ specialized cleaning techniques and methods to preserve cleanliness and prevent property damage. The effectiveness of cleaning is crucial, and many services offer eco-friendly, or 'green,' cleaning solutions. Sanitizing and disinfection, including electrostatic spray disinfection, are essential for maintaining hygienic conditions in healthcare facilities and commercial kitchens. Bonded and insured cleaning services ensure a reliable and trustworthy cleaning experience for clients.
Get a glance at the market share of various segments Request Free Sample
The Commercial segment was valued at USD 78.20 billion in 2019 and showed a gradual increase during the forecast period.
Market Dynamics
Our researchers analyzed the data with 2024 as the base year,
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CleanPatrick: A Benchmark for Data Cleaning
Welcome to CleanPatrick, the first large-scale benchmark designed for data cleaning in the image domain. Built on the Fitzpatrick17k dermatology dataset, CleanPatrick is a dataset for measuring the performance in detecting three major data quality issues: off-topic samples, near-duplicates, and label errors.
Overview
CleanPatrick consists of dermatological images annotated with over 500,000 binary labels across three data… See the full description on the dataset page: https://huggingface.co/datasets/Digital-Dermatology/CleanPatrick.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder "XPS data" contains original and evaluated XPS data (.vms) on a p-GaN sample which was treated at various temperatures and underwent Ar+ irradiation.
Furthermore, the folder "REM Images" contains REM images (.tif) and EDX data (.xlsx) on the used excessively treated sample.
All images that are published in the main manuscript are collected as .tif files in the folder "images".
This is USDA-ARS data from the publication: "Evaluation of Alternative-Design Cotton Gin Lint Cleaning Machines on Fiber Length Uniformity Index". The study was conducted in 2018 & 2019 with continued sample and data analysis through August 2023. Developing cotton ginning methods that improve fiber length uniformity index to levels that are compatible with the newer and more efficient spinning technologies would expand market share and increase the demand for cotton products and give U.S. cotton a competitive edge to synthetic fibers. Older studies on lint cleaning machines showed that the most widely used feed mechanism that places fiber on the cleaning cylinder damages fiber and reduces uniformity. The present study evaluates how conventional and experimental feed mechanisms affect uniformity. The lint cleaners were used with both saw and roller gin stands. Four diverse cotton cultivars from the Far West, Southwest, and Mid-South were used in the test. The data included gining process variables, raw seed cotton characteristics, raw lint High Volume Instrument (HVI), Advanced Fiber Information System (AFIS), and micro-dust and trash analyzer (MDTA3) measurements.
CodeParrot 🦜 Dataset Cleaned
What is it?
A dataset of Python files from Github. This is the deduplicated version of the codeparrot.
Processing
The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
Deduplication Remove exact matches
Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)
For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.