9 datasets found

B
Data Cleaning Sample
borealisdata.ca
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
M
MRO Data Cleansing and Enrichment Service Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). MRO Data Cleansing and Enrichment Service Report [Dataset]. https://www.marketreportanalytics.com/reports/mro-data-cleansing-and-enrichment-service-76185
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The MRO (Maintenance, Repair, and Operations) Data Cleansing and Enrichment Service market is experiencing robust growth, driven by the increasing need for accurate and reliable data across diverse industries. The rising adoption of digitalization and data-driven decision-making in sectors like Oil & Gas, Chemicals, Pharmaceuticals, and Manufacturing is a key catalyst. Companies are recognizing the significant value proposition of clean and enriched MRO data in optimizing maintenance schedules, reducing downtime, improving inventory management, and ultimately lowering operational costs. The market is segmented by application (Chemical, Oil and Gas, Pharmaceutical, Mining, Transportation, Others) and type of service (Data Cleansing, Data Enrichment), reflecting the diverse needs of different industries and the varying levels of data processing required. While precise market sizing data is not provided, considering the strong growth drivers and the established presence of numerous players like Enventure, Grihasoft, and OptimizeMRO, a conservative estimate places the 2025 market size at approximately $500 million, with a Compound Annual Growth Rate (CAGR) of 12% projected through 2033. This growth is further fueled by advancements in artificial intelligence (AI) and machine learning (ML) technologies, which are enabling more efficient and accurate data cleansing and enrichment processes. The competitive landscape is characterized by a mix of established players and emerging companies. Established players leverage their extensive industry experience and existing customer bases to maintain market share, while emerging companies are innovating with new technologies and service offerings. Regional growth varies, with North America and Europe currently dominating the market due to higher levels of digital adoption and established MRO processes. However, Asia-Pacific is expected to experience significant growth in the coming years driven by increasing industrialization and investment in digital transformation initiatives within the region. Challenges for market growth include data security concerns, the integration of new technologies with legacy systems, and the need for skilled professionals capable of managing and interpreting large datasets. Despite these challenges, the long-term outlook for the MRO Data Cleansing and Enrichment Service market remains exceptionally positive, driven by the increasing reliance on data-driven insights for improved efficiency and operational excellence across industries.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
i
Household Income and Expenditure 2010 - Tuvalu
catalog.ihsn.org
dev.ihsn.org
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Statistics Division (2019). Household Income and Expenditure 2010 - Tuvalu [Dataset]. http://catalog.ihsn.org/catalog/3203
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Central Statistics Division
Time period covered
2010
Area covered
Tuvalu
Description
Abstract

The main objectives of the survey were: - To obtain weights for the revision of the Consumer Price Index (CPI) for Funafuti; - To provide information on the nature and distribution of household income, expenditure and food consumption patterns; - To provide data on the household sector's contribution to the National Accounts - To provide information on economic activity of men and women to study gender issues - To undertake some poverty analysis

Geographic coverage

National, including Funafuti and Outer islands

Analysis unit

Household

individual

Universe

All the private household are included in the sampling frame. In each household selected, the current resident are surveyed, and people who are usual resident but are currently away (work, health, holydays reasons, or border student for example. If the household had been residing in Tuvalu for less than one year: - but intend to reside more than 12 months => The household is included - do not intend to reside more than 12 months => out of scope

Kind of data

Sample survey data [ssd]

Sampling procedure

It was decided that 33% (one third) sample was sufficient to achieve suitable levels of accuracy for key estimates in the survey. So the sample selection was spread proportionally across all the island except Niulakita as it was considered too small. For selection purposes, each island was treated as a separate stratum and independent samples were selected from each. The strategy used was to list each dwelling on the island by their geographical position and run a systematic skip through the list to achieve the 33% sample. This approach assured that the sample would be spread out across each island as much as possible and thus more representative.

For details please refer to Table 1.1 of the Report.

Sampling deviation

Only the island of Niulakita was not included in the sampling frame, considered too small.

Mode of data collection

Face-to-face [f2f]

Research instrument

There were three main survey forms used to collect data for the survey. Each question are writen in English and translated in Tuvaluan on the same version of the questionnaire. The questionnaires were designed based on the 2004 survey questionnaire.

HOUSEHOLD FORM - composition of the household and demographic profile of each members - dwelling information - dwelling expenditure - transport expenditure - education expenditure - health expenditure - land and property expenditure - household furnishing - home appliances - cultural and social payments - holydays/travel costs - Loans and saving - clothing - other major expenditure items

INDIVIDUAL FORM - health and education - labor force (individu aged 15 and above) - employment activity and income (individu aged 15 and above): wages and salaries, working own business, agriculture and livestock, fishing, income from handicraft, income from gambling, small scale activies, jobs in the last 12 months, other income, childreen income, tobacco and alcohol use, other activities, and seafarer

DIARY (one diary per week, on a 2 weeks period, 2 diaries per household were required) - All kind of expenses - Home production - food and drink (eaten by the household, given away, sold) - Goods taken from own business (consumed, given away) - Monetary gift (given away, received, winning from gambling) - Non monetary gift (given away, received, winning from gambling)

Questionnaire Design Flaws Questionnaire design flaws address any problems with the way questions were worded which will result in an incorrect answer provided by the respondent. Despite every effort to minimize this problem during the design of the respective survey questionnaires and the diaries, problems were still identified during the analysis of the data. Some examples are provided below:

Gifts, Remittances & Donations Collecting information on the following: - the receipt and provision of gifts - the receipt and provision of remittances - the provision of donations to the church, other communities and family occasions is a very difficult task in a HIES. The extent of these activities in Tuvalu is very high, so every effort should be made to address these activities as best as possible. A key problem lies in identifying the best form (questionnaire or diary) for covering such activities. A general rule of thumb for a HIES is that if the activity occurs on a regular basis, and involves the exchange of small monetary amounts or in-kind gifts, the diary is more appropriate. On the other hand, if the activity is less infrequent, and involves larger sums of money, the questionnaire with a recall approach is preferred. It is not always easy to distinguish between the two for the different activities, and as such, both the diary and questionnaire were used to collect this information. Unfortunately it probably wasn?t made clear enough as to what types of transactions were being collected from the different sources, and as such some transactions might have been missed, and others counted twice. The effects of these problems are hopefully minimal overall.

Defining Remittances Because people have different interpretations of what constitutes remittances, the questionnaire needs to be very clear as to how this concept is defined in the survey. Unfortunately this wasn?t explained clearly enough so it was difficult to distinguish between a remittance, which should be of a more regular nature, and a one-off monetary gift which was transferred between two households.

Business Expenses Still Recorded The aim of the survey is to measure "household" expenditure, and as such, any expenditure made by a household for an item or service which was primarily used for a business activity should be excluded. It was not always clear in the questionnaire that this was the case, and as such some business expenses were included. Efforts were made during data cleaning to remove any such business expenses which would impact significantly on survey results.

Purchased goods given away as a gift When a household makes a gift donation of an item it has purchased, this is recorded in section 5 of the diary. Unfortunately it was difficult to know how to treat these items as it was not clear as to whether this item had been recorded already in section 1 of the diary which covers purchases. The decision was made to exclude all information of gifts given which were considered to be purchases, as these items were assumed to have already been recorded already in section 1. Ideally these items should be treated as a purchased gift given away, which in turn is not household consumption expenditure, but this was not possible.

Some key items missed in the Questionnaire Although not a big issue, some key expenditure items were omitted from the questionnaire when it would have been best to collect them via this schedule. A key example being electric fans which many households in Tuvalu own.

Cleaning operations

Consistency of the data: - each questionnaire was checked by the supervisor during and after the collection - before data entry, all the questionnaire were coded - the CSPRo data entry system included inconsistency checks which allow the NSO staff to point some errors and to correct them with imputation estimation from their own knowledge (no time for double entry), 4 data entry operators. - after data entry, outliers were identified in order to check their consistency.

All data entry, including editing, edit checks and queries, was done using CSPro (Census Survey Processing System) with additional data editing and cleaning taking place in Excel.

The staff from the CSD was responsible for undertaking the coding and data entry, with assistance from an additional four temporary staff to help produce results in a more timely manner.

Although enumeration didn't get completed until mid June, the coding and data entry commenced as soon as forms where available from Funafuti, which was towards the end of March. The coding and data entry was then completed around the middle of July.

A visit from an SPC consultant then took place to undertake initial cleaning of the data, primarily addressing missing data items and missing schedules. Once the initial data cleaning was undertaken in CSPro, data was transferred to Excel where it was closely scrutinized to check that all responses were sensible. In the cases where unusual values were identified, original forms were consulted for these households and modifications made to the data if required.

Despite the best efforts being made to clean the data file in preparation for the analysis, no doubt errors will still exist in the data, due to its size and complexity. Having said this, they are not expected to have significant impacts on the survey results.

Under-Reporting and Incorrect Reporting as a result of Poor Field Work Procedures The most crucial stage of any survey activity, whether it be a population census or a survey such as a HIES is the fieldwork. It is crucial for intense checking to take place in the field before survey forms are returned to the office for data processing. Unfortunately, it became evident during the cleaning of the data that fieldwork wasn?t checked as thoroughly as required, and as such some unexpected values appeared in the questionnaires, as well as unusual results appearing in the diaries. Efforts were made to indentify the main issues which would have the greatest impact on final results, and this information was modified using local knowledge, to a more reasonable answer, when required.

Data Entry Errors Data entry errors are always expected, but can be kept to a minimum with
Supporting Clean-Up of Contaminated Sites with Decision Analysis: A Case...
catalog.data.gov
datasets.ai
Updated Dec 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Supporting Clean-Up of Contaminated Sites with Decision Analysis: A Case Study on Prioritization of Remediation Alternatives in Superfund [Dataset]. https://catalog.data.gov/dataset/supporting-clean-up-of-contaminated-sites-with-decision-analysis-a-case-study-on-prioritiz
Explore at:
Dataset updated
Dec 6, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The summary from the detailed analysis of the case study in EPA (1988b) is provided in Table 3 of the manuscript, and was used as the data source for the two datasets used in this study. These include a flat and hierarchical structure of the five balancing criteria, shown in Table 4 and Table 5, respectively. Table 4 provides a comprehensive score for each balancing criterion, similar to the summary tables presented in the FS of Superfund sites (e.g., (EPA 2016b, AECOM 2019)). Table 5 uses the same information in Table 3, but in this case, each piece of information is used to define multiple sub-criteria for each balancing criterion, except the cost one. This leads to a much more elaborate information table with the four remaining balancing criteria, now characterized by 13 sub-criteria. It is important to note that the scoring provided in Table 4 and Table 5, with the exception of the cost (c_5), were derived from the author’s interpretation of the descriptive language of the detailed analysis in for the hypothetical case study in presented in Table A-7 in Appendix A of the guidance document of EPA (1988b). It should be noted that the analysis of the three remedy alternatives presented in this hypothetical case study is governed by site-specific characteristics and may not represent potential performance of these remediation alternatives for other sites . The intent of this exercise is to illustrate the flexibility and adaptability of the MCDA process to address both the main, overarching criteria, as well as sub-criteria that may have specific importance in the decision process for a particular site. Ultimately, the sub-criteria can be adapted to address specific stakeholder perspectives or technical factors that may be linked to properties unique to the contaminant or physical characteristics of the site. This dataset is associated with the following publication: Cinelli, M., M.A. Gonzalez, R. Ford, J. McKernan, S. Corrente, M. Kadziński, and R. Słowiński. Supporting contaminated sites management with Multiple Criteria Decision Analysis: Demonstration of a regulation-consistent approach. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 316: 128347, (2021).
Alpaca Cleaned
kaggle.com
huggingface.co
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca Cleaned

Improving Pretrained Language Model Understanding

By Huggingface Hub [source]

About this dataset

Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

To make the most out of this dataset it is recommended to:

Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

Research Ideas

Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.

Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.

Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
d
Clean Air Tracking System (CATS) Permits
catalog.data.gov
bronx.lehman.cuny.edu
+4more
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2025). Clean Air Tracking System (CATS) Permits [Dataset]. https://catalog.data.gov/dataset/cats-permits
Explore at:
Dataset updated
Jul 26, 2025
Dataset provided by
data.cityofnewyork.us
Description
Clean Air Tracking System (CATS) is online application with end-to-end process where NYC Residents can submit for New Boiler Registration, Boiler Registration Renewal, Affidavit, Amendment, Boiler Work Permits, Inspection requests, Emergency Engine, Generator Registration, Gas Stations, Industrial Work Permits. For additional context, please go to this link: https://a826-web01.nyc.gov/DEP.BoilerInformationExt/ as the external source to this dataset.
A
Auto Parts Cleanliness Analysis System Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Auto Parts Cleanliness Analysis System Report [Dataset]. https://www.datainsightsmarket.com/reports/auto-parts-cleanliness-analysis-system-1504771
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
May 17, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Auto Parts Cleanliness Analysis System market is experiencing robust growth, driven by the increasing demand for high-quality and reliable automotive components. Stringent quality control standards within the automotive industry necessitate the adoption of advanced cleanliness analysis systems to ensure optimal performance and longevity of vehicle parts. The market is segmented by application (commercial and business vehicles) and by the type of cleaning process (washed and dry cleaning), with the demand for washed parts analysis systems currently dominating due to established manufacturing processes. Technological advancements, such as the incorporation of automated systems and improved imaging techniques, are further propelling market growth. The rising adoption of sophisticated manufacturing processes and stricter emission norms are key factors driving demand for precise cleanliness analysis. This trend is particularly evident in regions with established automotive manufacturing hubs like North America and Europe, although the Asia-Pacific region is experiencing rapid growth fueled by increasing vehicle production and a burgeoning automotive supply chain. The competitive landscape is characterized by a mix of established players like Leica and Olympus alongside emerging technology providers, signifying ongoing innovation and market expansion. While the initial investment in these systems can be significant, the long-term benefits in terms of improved quality control, reduced production costs associated with defects, and enhanced brand reputation outweigh the initial expense, driving sustained market expansion. The forecast period (2025-2033) anticipates continued growth, primarily fueled by the expansion of the electric vehicle (EV) market. The stringent cleanliness requirements for EV components, particularly batteries and power electronics, are generating significant demand for advanced analysis systems. Furthermore, the increasing adoption of Industry 4.0 principles and the integration of smart manufacturing technologies will continue to drive the demand for automated and data-driven cleanliness analysis solutions. Challenges remain, however, including the high cost of advanced systems and the need for skilled personnel to operate and interpret the results. Despite these challenges, the overall market outlook remains positive, driven by the overarching need for improved quality control and efficiency within the automotive manufacturing sector. We project a sustained CAGR, reflecting the ongoing technological advancements and increasing regulatory pressures within the automotive industry.
Clean dirty containers in Montevideo
kaggle.com
Updated Aug 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Laguna (2021). Clean dirty containers in Montevideo [Dataset]. https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rodrigo Laguna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Montevideo
Description
Context

It all started during last #StayAtHome during 2020's pandemic: some neighbors worried about trash in Montevideo's container.

The goal is to automatically detect clean from dirty containers to ask for maintenance.

Want to know more about the entire process? Checkout this thread on how it began, and this other with respect to version 6 update process.

Content

Data is splitted in training/testing split, they are independent. However, each split contains several near duplicate images (typicaly, same container from different perspectives or days). Image sizes differ a lot among them.

There are four major sources: * Images taken from Google Street View, they are 600x600 pixels, automatically collected through its API. * Images contributed by individual persons, most of which I took my self. * Images taken from social networks (Twitter & Facebook) and news. * Images contributed by pormibarrio.uy - 17-11-2020

Images were took from green containers, the most popular in Montevideo, but also widely used in some other cities.

Current version (clean-dirty-garbage-containers-V6) is also available here or you can download it as follows: wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1mdfJoOrO6MeTc3eMEjIDkAKlwK9bUFg6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/ /p')&id=1mdfJoOrO6MeTc3eMEjIDkAKlwK9bUFg6" -O clean-dirty-garbage-containers-V6.zip && rm -rf /tmp/cookies.txt This is specially useful if you want to download it in Google Colab.

This repo contains the code used during its building and documentation process, including the baselines for the purposed tasks.

Dataset on news

Since this is a hot topic in Montevideo, specially nowadays, with elections next week, it catch some attention from local press:

19-09-2020: Promueven solución de inteligencia artificial para evitar basura alrededor de contenedores, El Observador.

24-09-2020: Ingeniero en computación trabaja en proyecto para monitorear contenedores de basura, El Pais.

09-10-2020: Rodrigo Laguna: monitoreo de contenedores de basura, La mañana en casa.

Acknowledgements

Thanks to every single person who give me images from their containers. Special thanks to my friend Diego, whose idea of using google street view as a source of data really contributed to increase the dataset. And finally to my wife, who supported me during this project and contributed a lot to this dataset.

Citation

If you use these data in a publication, presentation, or other research project or product, please use the following citation:

Laguna, Rodrigo. 2021. Clean dirty containers in Montevideo - Version 6.1. url: https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo

@dataset{RLaguna-clean-dirty:2021, author = {Rodrigo Laguna}, title = {Clean dirty containers in Montevideo}, year = {2021}, url = {https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo}, version = {6.1} }

Contact

I'm on twitter, @ro_laguna_ or write to me r.laguna.queirolo at outlook.com

Future steps:

Add images from mapillary, an open source project similar to GoogleStreetView.

Keep going on with manually taken images.

Add any image from anyone who would like to contribute.

Develop & deploy a bot for automatically report container's status.

Translate docs to Spanish

Crop images to let one and only one container per image, taking most of the image

Changelog

19-05-2020: V1 - Initial version

20-05-2020: V2 - Include more training samples

12-09-2020: V3 - Include more training (+676) & testing (+64) samples:

train/clean from 574 to 1005 (+431)

train/dirty from 365 to 610 (+245)

test/clean from 100 to 128 (+28)

test/dirty from 100 to 136 (+36)

21-12-2020: V4 - Include more training (+367) & testing (+794) samples, including ~400...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

153 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

MRO Data Cleansing and Enrichment Service Report

LSC (Leicester Scientific Corpus)

Household Income and Expenditure 2010 - Tuvalu

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Supporting Clean-Up of Contaminated Sites with Decision Analysis: A Case...

Alpaca Cleaned

Alpaca Cleaned

Improving Pretrained Language Model Understanding

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Clean Air Tracking System (CATS) Permits

Auto Parts Cleanliness Analysis System Report

Clean dirty containers in Montevideo

Context

Content

Dataset on news

Acknowledgements

Citation

Contact

Future steps:

Changelog

Data Cleaning Sample