Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.
This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.
Column Name | Description | Example Values |
---|---|---|
Order ID | A unique identifier for each order. | ORD_123456 |
Customer ID | A unique identifier for each customer. | CUST_001 |
Category | The category of the purchased item. | Main Dishes , Drinks |
Item | The name of the purchased item. May contain missing values due to data dirt. | Grilled Chicken , None |
Price | The static price of the item. May contain missing values. | 15.0 , None |
Quantity | The quantity of the purchased item. May contain missing values. | 1 , None |
Order Total | The total price for the order (Price * Quantity ). May contain missing values. | 45.0 , None |
Order Date | The date when the order was placed. Always present. | 2022-01-15 |
Payment Method | The payment method used for the transaction. May contain missing values due to data dirt. | Cash , None |
Data Dirtiness:
Item
, Price
, Quantity
, Order Total
, Payment Method
) simulate real-world challenges.Item
is present.Price
is present.Quantity
and Order Total
are present.Price
or Quantity
is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity
).Menu Categories and Items:
Chicken Melt
, French Fries
.Grilled Chicken
, Steak
.Chocolate Cake
, Ice Cream
.Coca Cola
, Water
.Mashed Potatoes
, Garlic Bread
.3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.
Handle Missing Values:
Order Total
or Quantity
using the formula: Order Total = Price * Quantity
.Price
from Order Total / Quantity
if both are available.Validate Data Consistency:
Order Total = Price * Quantity
) match.Analyze Missing Patterns:
Category | Item | Price |
---|---|---|
Starters | Chicken Melt | 8.0 |
Starters | French Fries | 4.0 |
Starters | Cheese Fries | 5.0 |
Starters | Sweet Potato Fries | 5.0 |
Starters | Beef Chili | 7.0 |
Starters | Nachos Grande | 10.0 |
Main Dishes | Grilled Chicken | 15.0 |
Main Dishes | Steak | 20.0 |
Main Dishes | Pasta Alfredo | 12.0 |
Main Dishes | Salmon | 18.0 |
Main Dishes | Vegetarian Platter | 14.0 |
Desserts | Chocolate Cake | 6.0 |
Desserts | Ice Cream | 5.0 |
Desserts | Fruit Salad | 4.0 |
Desserts | Cheesecake | 7.0 |
Desserts | Brownie | 6.0 |
Drinks | Coca Cola | 2.5 |
Drinks | Orange Juice | 3.0 |
Drinks ... |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial data analysis checklist for data screening in longitudinal studies.
Quadrant provides Insightful, accurate, and reliable mobile location data.
Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.
These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.
We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.
We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.
Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.
Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original EPIC-1 data source and documented intermediate data manipulation. These files are provided in order to ensure a complete audit trail and documentation. These files include original source data, as well as files created in the process of cleaning and preparing the datasets found in section I of the dataverse (1. Pooled and Adjusted EPIC Data). These intermediary files contain documentation in any adjustment in assumptions, currency conversions, and data cleaning processes. Ordinarily, analysis would be done using the datasets in section I. Researchers would not find the need to use the files in this section unless for tracing the origin of the variables to the original source. “Adjustments for the EPIC-2 data is conducted with advice and input from data collection team (EPIC-1). The magnitude of these adjustments are documented in the table attached. These documented adjustments explained the lion’s share of the discrepancies, leaving only minor unaccounted differences in the data (Δ range 0% - 1.1%).” “In addition to using the sampling weights, any extrapolation to achieve nationwide cost estimates for Benin, Ghana, Zambia, and Honduras uses scale-up factor to take into account facilities that are outside of the sampling frame. For example, after taking into account the sampling weights, the total facility-level delivery cost in Benin sampling frame (343 facilities) is $2,094,031. To estimate the total facility-level delivery cost in the entire country of Benin (695 facilities), the sample-frame cost estimate is multiplied by 695/343. “Additional adjustments for the EPIC-2 analysis include the series of decisions for weighting, methods, and data sources. For EPIC-2 analyses, average costs per dose and DTP3 were calculated as total costs divided by total outputs, representing a funder’s perspective. We also report results as a simple average of the site-level cost per output. All estimates were adjusted for survey weighting. In particular, the analyses in EPIC-2 relied exclusively on information from the sample, whereas in some instance EPIC-1 teams were able to strategically leverage other available data sources.”
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset and documentation contains detailed information of the iTEM Open Database, a harmonized transport data set of historical values, 1970 - present. It aims to create transparency through two key features:
The iTEM Open Database is comprised of individual datasets collected from public sources. Each dataset is downloaded, cleaned, and harmonised to the common region and technology definitions defined by the iTEM consortium https://transportenergy.org. For each dataset, we describe the name of the dataset, the web link to the original source, the web link to the cleaning script (in python), variables, and explain the data cleaning steps (which explains the data cleaning script in plain English).
Shall you find any problems with the dataset, please report the issues here https://github.com/transportenergy/database/issues.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
The main objective of this project is to collect household data for the ongoing assessment and monitoring of the socio-economic impacts of COVID-19 on households and family businesses in Vietnam. The estimated field work and sample size of households in each round is as follows:
Round 1 June fieldwork- approximately 6300 households (at least 1300 minority households) Round 2 August fieldwork - approximately 4000 households (at least 1000 minority households) Round 3 September fieldwork- approximately 4000 households (at least 1000 minority households) Round 4 December- approximately 4000 households (at least 1000 minority households) Round 5 - pending discussion
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. Out of the 15 households, 3 households have information collected on both income and expenditure (large module) as well as many other aspects. The remaining 12 other households have information collected on income, but do not have information collected on expenditure (small module). Therefore, estimation of large module includes 9396 households and are representative at regional and national levels, while the whole sample is representative at the provincial level.
We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. The sample size of large module has 9396 households, of which, there are 7951 households having phone number (cell phone or line phone).
After data processing, the final sample size is 6,213 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 1 consisted of the following sections Section 2. Behavior Section 3. Health Section 4. Education & Child caring Section 5A. Employment (main respondent) Section 5B. Employment (other household member) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
The target for Round 1 is to complete interviews for 6300 households, of which 1888 households are located in urban area and 4475 households in rural area. In addition, at least 1300 ethnic minority households are to be interviewed. A random selection of 6300 households was made out of 7951 households for official interview and the rest as for replacement. However, the refusal rate of the survey was about 27 percent, and households from the small module in the same EA were contacted for replacement and these households are also randomly selected.
The do-file marital_spouselinks.do combines all data on people's marital statuses and reported spouses to create the following datasets: 1. all_marital_reports - a listing of all the times an individual has reported their current marital status with the id numbers of the reported spouse(s); this listing is as reported so may include discrepancies (i.e. a 'Never married' status following a 'Married' one) 2. all_spouse_pairs_full - a listing of each time each spouse pair has been reported plus summary information on co-residency for each pair 3. all_spouse_pairs_clean_summarised - this summarises the data from all_spouse_pairs_full to give start and end dates of unions 4. marital_status_episodes - this combines data from all the sources to create episodes of marital status, each has a start and end date and a marital status, and if currently married, the spouse ids of the current spouse(s) if reported. There are several variables to indicate where each piece of information is coming from.
The first 2 datasets are made available in case people need the 'raw' data for any reason (i.e. if they only want data from one study) or if they wish to summarise the data in a different way to what is done for the last 2 datasets.
The do-file is quite complicated with many sources of data going through multiple processes to create variables in the datasets so it is not always straightforward to explain where each variable come from on the documentation. The 4 datasets build on each other and the do-file is documented throughout so anyone wanting to understand in great detail may be better off examining that. However, below is a brief description of how the datasets are created:
Marital status data are stored in the tables of the study they were collected in: AHS Adult Health Study [ahs_ahs1] CEN Census (initial CRS census) [cen_individ] CENM In-migration (CRS migration form) [crs_cenm] GP General form (filled for various reasons) [gp_gpform] SEI Socio-economic individual (annual survey from 2007 onwards) [css_sei] TBH TB household (study of household contacts of TB patients) [tb_tbh] TBO TB controls (matched controls for TB patients) [tb_tbo & tb_tboto2007] TBX TB cases (TB patients) [tb_tbx & tb_tbxto2007] In many of the above surveys as well as their current marital status, people were asked to report their current and past spouses along with (sometimes) some information about the marriage (start/end year etc.). These data are stored all together on the table gen_spouse, with variables indicating which study the data came from. Further evidence of spousal relationships is taken from gen_identity (if a couple appear as co-parents to a CRS member) and from crs_residency_episodes_clean_poly, a combined dataset (if they are living in the same household at the same time). Note that co-parent couples who are not reported in gen_spouse are only retained in the datasets if they have co-resident episodes.
The marital status data are appended together and the spouse id data merged in. Minimal data editing/cleaning is carried out. As the spouse data are in long format, this dataset is reshaped wide to have one line per marital status report (polygamy in the area allows for men to have multiple spouses at one time): this dataset is saved as all_marital_reports.
The list of reported spouses on gen_spouse is appended to a list of co-parents (from gen_identity) and this list is cleaned to try to identify and remove obvious id errors (incestuous links, same sex [these are not reported in this culture] and large age difference). Data reported by men and women are compared and variables created to show whether one or both of the couple report the union. Many records have information on start and end year of marriage, and all have the date the union was reported. This listing is compared to data from residency episodes to add dates that couples were living together (not all have start/end dates so this is to try to supplement this), in addition the dates that each member of the couple was last known to be alive or first known to be dead are added (from the residency data as well). This dataset with all the records available for each spouse pair is saved as all_spouse_pairs_full.
The date data from all_spouse_pairs_full are then summarised to get one line per couple with earliest and latest known married date for all, and, if available, marriage and separation date. For each date there are also variables created to indicate the source of the data.
As culture only allows for women having one spouse at a time, records for women with 'overlapping' husbands are cleaned. This dataset is then saved as all_spouse_pairs_clean_summarised.
Both the cleaned spouse pairs and the cleaned marital status datasets are converted into episodes: the spouse listing using the marriage or first known married date as the beginning and the last known married plus a year or separation date as the end, the marital status data records collapsed into periods of the same status being reported (following some cleaning to remove impossible reports) and the start date being the first of these reports, the end date being the last of the reports plus a year. These episodes are appended together and a series of processes run several times to remove overalapping episodes. To be able to assign specific spouse ids to each married episode, some episodes need to be 'split' into more than one (i.e. if a man is married to one woman from 2005 to 2017 and then marries another woman in 2008 and remains married to her till 2017 his intial married episode would be from 2005 to 2017, but this would need to be split into one from 2005 to 2008 which would just have 1 idspouse attached and another from 2008 to 2017, which would have 2 idspouse attached). After this splitting process the spouse ids are merged in.
The final episode dataset is saved as marital_status_episodes.
Individual
Face-to-face [f2f]
Link Function: information
Quadrant provides Insightful, accurate, and reliable mobile location data.
Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.
These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.
We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.
We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.
Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.
Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.
https://datos.madrid.es/egob/catalogo/aviso-legalhttps://datos.madrid.es/egob/catalogo/aviso-legal
The City Council of Madrid aims to promote the quality of life in the city, with urban cleaning being one of its main aspects. In order to incorporate the opinion of the citizens on the way in which the streets are dirty, this survey is carried out. The 'Associated documentation ' section includes the data structure file (Registration design, values and field structure of the results file), the data sheet and the questionnaire.
Capstone case study from Google Data Analytics Professional Certificate program.
This dataset was collected by Motivate International Inc. I've included only the last 12 months, from November 2020 to October 2021.
Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Moreno, the director of marketing and your manager, has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You will produce a report with the following deliverables: 1. A clear statement of the business task 2. A description of all data sources used 3. Documentation of any cleaning or manipulation of data 4. A summary of your analysis 5. Supporting visualizations and key findings 6. Your top three recommendations based on your analysis
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Business task Provide a high-level recommendation to help guide Bellabeat’s marketing strategy to unlock new growth opportunities.
Key stakeholders Urška Sršen, cofounder and Chief Creative Officer of Bellabeat Sando Mur, Mathematician and Bellabeat’s cofounder
Data sources used https://www.kaggle.com/arashnic/fitbit FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)
Documentation of any cleaning or manipulation of data
RStudio Cloud is the best tool for this project due to data size. Packages used
install.packages("lubridate") library(lubridate)
library(ggplot2)
install.packages("dplyr")
library(dplyr)
There's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global smart clean-in-place (CIP) skid market size was valued at USD 1.54 billion in 2024, with a robust growth trajectory anticipated over the coming years. The market is projected to reach USD 3.47 billion by 2033, expanding at a compelling CAGR of 9.4% from 2025 to 2033. This significant growth is primarily driven by the increasing demand for automation and efficiency in cleaning processes across industries such as food & beverage, pharmaceuticals, and chemicals. As per the latest research, the integration of advanced sensors and controllers, coupled with stringent hygiene regulations, continues to propel the adoption of smart CIP skids globally.
A primary growth factor for the smart clean-in-place skid market is the escalating emphasis on food safety and regulatory compliance. Industries such as food & beverage and pharmaceuticals are under constant scrutiny to maintain high standards of cleanliness and sanitation in their production environments. The implementation of smart CIP skids helps companies adhere to these regulations by delivering precise, repeatable, and validated cleaning cycles, thereby minimizing the risk of contamination and product recalls. Additionally, the ability of these systems to automate cleaning protocols reduces the need for manual intervention, further enhancing operational efficiency and ensuring that hygiene standards are consistently met. The growing consumer awareness regarding food safety and the increasing stringency of global health regulations are compelling manufacturers to invest in advanced CIP technologies, fueling market growth.
Another significant driver is the rising trend of process optimization and resource efficiency within industrial operations. Smart CIP skids are equipped with advanced components such as sensors, controllers, and automated valves, which enable real-time monitoring and control of cleaning parameters. This technological advancement leads to substantial savings in water, energy, and cleaning agents, aligning with the sustainability goals of modern enterprises. Moreover, the integration of data analytics and IoT connectivity allows for predictive maintenance and performance optimization, reducing downtime and operational costs. As industries continue to prioritize sustainable practices and cost reduction, the adoption of intelligent CIP solutions is expected to accelerate, further bolstering market expansion over the forecast period.
The rapid pace of digital transformation and Industry 4.0 initiatives is also playing a pivotal role in shaping the smart CIP skid market landscape. Manufacturers are increasingly leveraging automation and digitalization to enhance production flexibility, traceability, and quality assurance. Smart CIP skids, with their ability to seamlessly integrate into existing manufacturing execution systems (MES) and supervisory control and data acquisition (SCADA) platforms, offer unparalleled benefits in terms of process transparency and control. This integration not only improves cleaning validation and documentation but also supports remote monitoring and diagnostics, enabling swift response to process deviations. The ongoing adoption of smart factory concepts and the proliferation of connected devices are expected to create new avenues for market growth, especially in regions with advanced industrial infrastructure.
From a regional perspective, Asia Pacific is emerging as a key growth engine for the smart clean-in-place skid market, driven by rapid industrialization and expanding manufacturing sectors in countries such as China, India, and Southeast Asia. North America and Europe continue to lead in terms of technological innovation and regulatory compliance, with established players investing heavily in automation and digitalization. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in food processing and pharmaceutical manufacturing. The diverse regional dynamics and varying adoption rates underscore the global nature of the smart CIP skid market, with each region presenting unique opportunities and challenges for stakeholders.
Sensors play a foundational role in the smart clean-in-place skid market, serving as the primary means of collecting real-time data on critical process parameters such as temperature, pressure, flow rate, and chemical concentration. The i
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cleaning indicators are widely used to evaluate the efficacy of cleaning processes in automated washer-disinfectors (AWDs) in healthcare settings. In this study, we systematically analyzed the performance of commercial indicators across multiple simulated cleaning protocols to guide the correct selection of suitable cleaning indicators in Central Sterile Supply Departments (CSSD). Eleven commercially available cleaning indicators were tested in five cleaning simulations, P0 to P4, where P1 represented the standard cleaning process in CSSD, while P2-P4 incorporated induced-error cleaning processes to mimic real-world errors. All indicators were uniformly positioned at the top level of the cleaning rack to ensure comparable exposure. Key parameters, including indicator response dynamics (e.g., wash-off sequence) and final residue results, were documented throughout the cleaning cycles. The final wash-off results given by the indicators under P0, in which no detergent was injected, were much worse than those of the other four processes. Under different simulations, the final results of the indicators and their wash-off sequences changed substantially. In conclusion, an effective indicator must be selected experimentally. The last indicator to be washed off during the normal cleaning process that can simultaneously clearly show the presence of dirt residue under induced error conditions is the optimal indicator for monitoring cleaning processes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis of the exposome and secular trends on cancer mortality. csv Data Record: The curated NHANES datasets and the data dictionaries includes 13 .csv files and 1 excel file. The curated NHANES datasets involves 10 .csv formatted files, one for each module and labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. The eleventh file is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 4,740 variables in NHANES ("dictionary_nhanes.csv"). The 12th csv file contains the harmonized categories for the categorical variables ("dictionary_harmonized_categories.csv"). The 13th file contains the dictionary for descriptors on the drugs codes (“dictionary_drug_codes.csv”). The 14th file is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES datasets (“nhanes_inconsistencies_documentation.xlsx”). R Data Record: For researchers who want to conduct their analysis in the R programming language, the curated NHANES datasets and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. We provided an .RData file that contains all the aforementioned datasets as R data objects (“w - nhanes_1988_2018.RData”). Also in this .RData file, we make available all R scripts on customized functions that were written to curate the data. We also provide an .R file that shows how we used the customized functions (i.e. our pipeline) to curate the data (“m - nhanes_1988_2018.R”).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.
This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.
Column Name | Description | Example Values |
---|---|---|
Order ID | A unique identifier for each order. | ORD_123456 |
Customer ID | A unique identifier for each customer. | CUST_001 |
Category | The category of the purchased item. | Main Dishes , Drinks |
Item | The name of the purchased item. May contain missing values due to data dirt. | Grilled Chicken , None |
Price | The static price of the item. May contain missing values. | 15.0 , None |
Quantity | The quantity of the purchased item. May contain missing values. | 1 , None |
Order Total | The total price for the order (Price * Quantity ). May contain missing values. | 45.0 , None |
Order Date | The date when the order was placed. Always present. | 2022-01-15 |
Payment Method | The payment method used for the transaction. May contain missing values due to data dirt. | Cash , None |
Data Dirtiness:
Item
, Price
, Quantity
, Order Total
, Payment Method
) simulate real-world challenges.Item
is present.Price
is present.Quantity
and Order Total
are present.Price
or Quantity
is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity
).Menu Categories and Items:
Chicken Melt
, French Fries
.Grilled Chicken
, Steak
.Chocolate Cake
, Ice Cream
.Coca Cola
, Water
.Mashed Potatoes
, Garlic Bread
.3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.
Handle Missing Values:
Order Total
or Quantity
using the formula: Order Total = Price * Quantity
.Price
from Order Total / Quantity
if both are available.Validate Data Consistency:
Order Total = Price * Quantity
) match.Analyze Missing Patterns:
Category | Item | Price |
---|---|---|
Starters | Chicken Melt | 8.0 |
Starters | French Fries | 4.0 |
Starters | Cheese Fries | 5.0 |
Starters | Sweet Potato Fries | 5.0 |
Starters | Beef Chili | 7.0 |
Starters | Nachos Grande | 10.0 |
Main Dishes | Grilled Chicken | 15.0 |
Main Dishes | Steak | 20.0 |
Main Dishes | Pasta Alfredo | 12.0 |
Main Dishes | Salmon | 18.0 |
Main Dishes | Vegetarian Platter | 14.0 |
Desserts | Chocolate Cake | 6.0 |
Desserts | Ice Cream | 5.0 |
Desserts | Fruit Salad | 4.0 |
Desserts | Cheesecake | 7.0 |
Desserts | Brownie | 6.0 |
Drinks | Coca Cola | 2.5 |
Drinks | Orange Juice | 3.0 |
Drinks ... |