100+ datasets found

A Journey through Data Cleaning
kaggle.com
zip
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kenanyafi (2024). A Journey through Data Cleaning [Dataset]. https://www.kaggle.com/datasets/kenanyafi/a-journey-through-data-cleaning
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 22, 2024
Authors
kenanyafi
Description
Embark on a transformative journey with our Data Cleaning Project, where we meticulously refine and polish raw data into valuable insights. Our project focuses on streamlining data sets, removing inconsistencies, and ensuring accuracy to unlock its full potential.

Through advanced techniques and rigorous processes, we standardize formats, address missing values, and eliminate duplicates, creating a clean and reliable foundation for analysis. By enhancing data quality, we empower organizations to make informed decisions, drive innovation, and achieve strategic objectives with confidence.

Join us as we embark on this essential phase of data preparation, paving the way for more accurate and actionable insights that fuel success."
d
B2B Data Cleansing Services - Verified Records - Updated Every 30 Days
datarade.ai
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomson Data (2024). B2B Data Cleansing Services - Verified Records - Updated Every 30 Days [Dataset]. https://datarade.ai/data-products/thomson-data-hr-data-reach-hr-professionals-across-the-world-thomson-data
Explore at:
.csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 8, 2022
Dataset authored and provided by
Thomson Data
Area covered
Panama, Zimbabwe, Eritrea, Palau, Denmark, Czech Republic, Bulgaria, Andorra, Micronesia (Federated States of), Finland
Description
At Thomson Data, we help businesses clean up and manage messy B2B databases to ensure they are up-to-date, correct, and detailed. We believe your sales development representatives and marketing representatives should focus on building meaningful relationships with prospects, not scrubbing through bad data.

Here are the key steps involved in our B2B data cleansing process:

Data Auditing: We begin with a thorough audit of the database to identify errors, gaps, and inconsistencies, which majorly revolve around identifying outdated, incomplete, and duplicate information.

Data Standardization: Ensuring consistency in the data records is one of our prime services; it includes standardizing job titles, addresses, and company names. It ensures that they can be easily shared and used by different teams.

Data Deduplication: Another way we improve efficiency is by removing all duplicate records. Data deduplication is important in a large B2B dataset as multiple records from the same company may exist in the database.

Data Enrichment: After the first three steps, we enrich your data, fill in the missing details, and then enhance the database with up-to-date records. This is the step that ensures the database is valuable, providing insights that are actionable and complete.

What are the Key Benefits of Keeping the Data Clean with Thomson Data’s B2B Data Cleansing Service? Once you understand the benefits of our data cleansing service, it will entice you to optimize your data management practices, and it will additionally help you stay competitive in today’s data-driven market.

Here are some advantages of maintaining a clean database with Thomson Data:

Better ROI for your Sales and Marketing Campaigns: Our clean data will magnify your precise targeting, enabling you to strategize for effective campaigns, increased conversion rate, and ROI.

Compliant with Data Regulations:
The B2B data cleansing services we provide are compliant to global data norms.

Streamline Operations: Your efforts are directed in the right channel when your data is clean and accurate, as your team doesn’t have to spend their valuable time fixing errors.

To summarize, we would again bring your attention to how accurate data is essential for driving sales and marketing in a B2B environment. It enhances your business prowess in the avenues of decision-making and customer relationships. Therefore, it is better to have a proactive approach toward B2B data cleansing service and outsource our offerings to stay competitive by unlocking the full potential of your data.

Send us a request and we will be happy to assist you.
d
Enviro-Champs Formshare Data Cleaning Tool
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Udhav Maharaj (2024). Enviro-Champs Formshare Data Cleaning Tool [Dataset]. http://doi.org/10.7910/DVN/EA5MOI
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EA5MOI
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Udhav Maharaj
Time period covered
Jan 1, 2023 - Jan 1, 2024
Description
A data cleaning tool customised for cleaning and sorting the data generated during the Enviro-Champs pilot study as they are downloaded from Formshare, the platform capturing data sent from a customised ODK Collect form collection app. The dataset inclues the latest data from the pilot study as at 14 May 2024.
Data Cleansing Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Cleansing Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-cleansing-software-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Cleansing Software Market Outlook

The global data cleansing software market size was valued at approximately USD 1.5 billion in 2023 and is projected to reach around USD 4.2 billion by 2032, exhibiting a compound annual growth rate (CAGR) of 12.5% during the forecast period. This substantial growth can be attributed to the increasing importance of maintaining clean and reliable data for business intelligence and analytics, which are driving the adoption of data cleansing solutions across various industries.

The proliferation of big data and the growing emphasis on data-driven decision-making are significant growth factors for the data cleansing software market. As organizations collect vast amounts of data from multiple sources, ensuring that this data is accurate, consistent, and complete becomes critical for deriving actionable insights. Data cleansing software helps organizations eliminate inaccuracies, inconsistencies, and redundancies, thereby enhancing the quality of their data and improving overall operational efficiency. Additionally, the rising adoption of advanced analytics and artificial intelligence (AI) technologies further fuels the demand for data cleansing software, as clean data is essential for the accuracy and reliability of these technologies.

Another key driver of market growth is the increasing regulatory pressure for data compliance and governance. Governments and regulatory bodies across the globe are implementing stringent data protection regulations, such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States. These regulations mandate organizations to ensure the accuracy and security of the personal data they handle. Data cleansing software assists organizations in complying with these regulations by identifying and rectifying inaccuracies in their data repositories, thus minimizing the risk of non-compliance and hefty penalties.

The growing trend of digital transformation across various industries also contributes to the expanding data cleansing software market. As businesses transition to digital platforms, they generate and accumulate enormous volumes of data. To derive meaningful insights and maintain a competitive edge, it is imperative for organizations to maintain high-quality data. Data cleansing software plays a pivotal role in this process by enabling organizations to streamline their data management practices and ensure the integrity of their data. Furthermore, the increasing adoption of cloud-based solutions provides additional impetus to the market, as cloud platforms facilitate seamless integration and scalability of data cleansing tools.

Regionally, North America holds a dominant position in the data cleansing software market, driven by the presence of numerous technology giants and the rapid adoption of advanced data management solutions. The region is expected to continue its dominance during the forecast period, supported by the strong emphasis on data quality and compliance. Europe is also a significant market, with countries like Germany, the UK, and France showing substantial demand for data cleansing solutions. The Asia Pacific region is poised for significant growth, fueled by the increasing digitalization of businesses and the rising awareness of data quality's importance. Emerging economies in Latin America and the Middle East & Africa are also expected to witness steady growth, driven by the growing adoption of data-driven technologies.

The role of Data Quality Tools cannot be overstated in the context of data cleansing software. These tools are integral in ensuring that the data being processed is not only clean but also of high quality, which is crucial for accurate analytics and decision-making. Data Quality Tools help in profiling, monitoring, and cleansing data, thereby ensuring that organizations can trust their data for strategic decisions. As organizations increasingly rely on data-driven insights, the demand for robust Data Quality Tools is expected to rise. These tools offer functionalities such as data validation, standardization, and enrichment, which are essential for maintaining the integrity of data across various platforms and applications. The integration of these tools with data cleansing software enhances the overall data management capabilities of organizations, enabling them to achieve greater operational efficiency and compliance with data regulations.

Component Analysis

The data cle
B
Data Cleaning Sample
borealisdata.ca
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
f
Full Dataset prior to Cleaning
figshare.com
zip
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paige Chesshire (2023). Full Dataset prior to Cleaning [Dataset]. http://doi.org/10.6084/m9.figshare.22455616.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22455616.v1
Dataset updated
Mar 31, 2023
Dataset provided by
figshare
Authors
Paige Chesshire
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes all of the data downloaded from GBIF (DOIs provided in README.md as well as below, downloaded Feb 2021) as well as data downloaded from SCAN. This dataset has 2,808,432 records and can be used as a reference to the verbatim data before it underwent the cleaning process. The only modifications made to this datset after direct download from the data portals are the following:

1) for GBIF records, I renamed the countryCode column to be "country" so that the column title is consistent across both GBIF and SCAN 2) A source column was added where I specify if the record came from GBIF or SCAN 3) Duplicate records across SCAN and GBIF were removed by identifying identical instances "catalogNumber" and "institutionCode" 4) Only the Darwin core columns (DwC) that were shared across downloaded datasets were retained. GBIF contained ~249 DwC variables, and SCAN data contained fewer, so this combined dataset only includes the ~80 columns shared between the two datasets

For GBIF, we downloaded the data in three separate chunks, therefore there are three DOIs. See below:

GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.6cxfsw GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.b9rfa7 GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.w2nndm
o
Data Cleaning with OpenRefine
explore.openaire.eu
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Ye (2020). Data Cleaning with OpenRefine [Dataset]. http://doi.org/10.5281/zenodo.6863001
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6863001
Dataset updated
Nov 9, 2020
Authors
Hao Ye
Description
OpenRefine (formerly Google Refine) is a powerful free and open source tool for data cleaning, enabling you to correct errors in the data, and make sure that the values and formatting are consistent. In addition, OpenRefine records your processing steps, enabling you to apply the same cleaning procedure to other data, and enhancing the reproducibility of your analysis. This workshop will teach you to use OpenRefine to clean and format data and automatically track any changes that you make.
d
Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...
datarade.ai
.json, .csv, .xls
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quadrant (2025). Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users | +200B Events / Month [Dataset]. https://datarade.ai/data-products/mobile-location-data-asia-300m-unique-devices-100m-da-quadrant
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Mar 21, 2025
Dataset authored and provided by
Quadrant
Area covered
Iran (Islamic Republic of), Armenia, Georgia, Israel, Palestine, Korea (Democratic People's Republic of), Oman, Bahrain, Kyrgyzstan, Philippines, Asia
Description
Quadrant provides Insightful, accurate, and reliable mobile location data.

Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.

These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.

We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.

We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.

Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.

Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.
D
Data Center Cleaning Service Report
marketresearchforecast.com
doc, pdf, ppt
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Data Center Cleaning Service Report [Dataset]. https://www.marketresearchforecast.com/reports/data-center-cleaning-service-14735
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The market for data center cleaning services is expected to grow from USD XXX million in 2025 to USD XXX million by 2033, at a CAGR of XX% during the forecast period 2025-2033. The growth of the market is attributed to the increasing number of data centers and the need to maintain these facilities in a clean environment. Data centers are critical to the functioning of the modern economy, as they house the servers that store and process vast amounts of data. Maintaining these facilities in a clean environment is essential to prevent the accumulation of dust and other contaminants, which can lead to equipment failures and downtime. The market for data center cleaning services is segmented by type, application, and region. By type, the market is segmented into equipment cleaning, ceiling cleaning, floor cleaning, and others. Equipment cleaning is the largest segment of the market, accounting for over XX% of the total market revenue in 2025. By application, the market is segmented into the internet industry, finance and insurance, manufacturing industry, government departments, and others. The internet industry is the largest segment of the market, accounting for over XX% of the total market revenue in 2025. By region, the market is segmented into North America, South America, Europe, the Middle East & Africa, and Asia Pacific. North America is the largest segment of the market, accounting for over XX% of the total market revenue in 2025.
Data Wrangling Market Analysis North America, Europe, APAC, Middle East and...
technavio.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio, Data Wrangling Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, UK, Germany, China, Japan - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/data-wrangling-market-industry-analysis
Explore at:
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Global
Description
Snapshot img

Data Wrangling Market Size 2024-2028

The data wrangling market size is forecast to increase by USD 1.4 billion at a CAGR of 14.8% between 2023 and 2028. The market is experiencing significant growth due to the numerous benefits provided by data wrangling solutions, including data cleaning, transformation, and enrichment. One major trend driving market growth is the rising need for technology such as the competitive intelligence and artificial intelligence in the healthcare sector, where data wrangling is essential for managing and analyzing patient data to improve patient outcomes and reduce costs. However, a challenge facing the market is the lack of awareness of data wrangling tools among small and medium-sized enterprises (SMEs), which limits their ability to effectively manage and utilize their data. Despite this, the market is expected to continue growing as more organizations recognize the value of data wrangling in driving business insights and decision-making.

What will be the Size of the Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth due to the increasing demand for data management and analysis in various industries. The market is experiencing significant growth due to the increasing volume, variety, and velocity of data being generated from various sources such as IoT devices, financial services, and smart cities. Artificial intelligence and machine learning technologies are being increasingly used for data preparation, data cleaning, and data unification. Data wrangling, also known as data munging, is the process of cleaning, transforming, and enriching raw data to make it usable for analysis. This process is crucial for businesses aiming to gain valuable insights from their data and make informed decisions. Data analytics is a primary driver for the market, as organizations seek to extract meaningful insights from their data. Cloud solutions are increasingly popular for data wrangling due to their flexibility, scalability, and cost-effectiveness.

Furthermore, both on-premises and cloud-based solutions are being adopted by businesses to meet their specific data management requirements. Multi-cloud strategies are also gaining traction in the market, as organizations seek to leverage the benefits of multiple cloud providers. This approach allows businesses to distribute their data across multiple clouds, ensuring business continuity and disaster recovery capabilities. Data quality is another critical factor driving the market. Ensuring data accuracy, completeness, and consistency is essential for businesses to make reliable decisions. The market is expected to grow further as organizations continue to invest in big data initiatives and implement advanced technologies such as AI and ML to gain a competitive edge. Data cleaning and data unification are key processes in data wrangling that help improve data quality. The finance and insurance industries are major contributors to the market, as they generate vast amounts of data daily.

In addition, real-time analysis is becoming increasingly important in these industries, as businesses seek to gain insights from their data in near real-time to make informed decisions. The Internet of Things (IoT) is also driving the market, as businesses seek to collect and analyze data from IoT devices to gain insights into their operations and customer behavior. Edge computing is becoming increasingly popular for processing IoT data, as it allows for faster analysis and decision-making. Self-service data preparation is another trend in the market, as businesses seek to empower their business users to prepare their data for analysis without relying on IT departments.

Moreover, this approach allows businesses to be more agile and responsive to changing business requirements. Big data is another significant trend in the market, as businesses seek to manage and analyze large volumes of data to gain insights into their operations and customer behavior. Data wrangling is a critical process in managing big data, as it ensures that the data is clean, transformed, and enriched to make it usable for analysis. In conclusion, the market in North America is experiencing significant growth due to the increasing demand for data management and analysis in various industries. Cloud solutions, multi-cloud strategies, data quality, finance and insurance, IoT, real-time analysis, self-service data preparation, and big data are some of the key trends driving the market. Businesses that invest in data wrangling solutions can gain a competitive edge by gaining valuable insights from their data and making informed decisions.

Market Segmentation

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

Sec
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
h
Data to Impact on various cleaning procedures on p-GaN surfaces
rodare.hzdr.de
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaber, Jana; Xiang, Rong; Arnold, André; Ryzhov, Anton; Teichert, Jochen; Murcek, Petr; Zwartek, Paul; Ma, Shuai; Michel, Peter (2023). Data to Impact on various cleaning procedures on p-GaN surfaces [Dataset]. http://doi.org/10.14278/rodare.2168
Explore at:
Unique identifier
https://doi.org/10.14278/rodare.2168
Dataset updated
Feb 23, 2023
Authors
Schaber, Jana; Xiang, Rong; Arnold, André; Ryzhov, Anton; Teichert, Jochen; Murcek, Petr; Zwartek, Paul; Ma, Shuai; Michel, Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder "XPS data" contains original and evaluated XPS data (.vms) on a p-GaN sample which was treated at various temperatures and underwent Ar+ irradiation.

Furthermore, the folder "REM Images" contains REM images (.tif) and EDX data (.xlsx) on the used excessively treated sample.

All images that are published in the main manuscript are collected as .tif files in the folder "images".
Household Survey on Information and Communications Technology– 2019 - West...
pcbs.gov.ps
Updated Mar 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2020). Household Survey on Information and Communications Technology– 2019 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/489
Explore at:
Dataset updated
Mar 16, 2020
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2019
Area covered
Gaza, Gaza Strip, West Bank
Description
Abstract

The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

Geographic coverage

Palestine, West Bank, Gaza strip

Analysis unit

Household, Individual

Universe

All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

Sample size The estimated sample size is 8,040 households.

Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.

Cleaning operations

Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.

Response rate

The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.

Sampling error estimates

Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
Household Survey on Information and Communications Technology 2014 - West...
datacatalog.ihsn.org
catalog.ihsn.org
Updated Oct 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2021). Household Survey on Information and Communications Technology 2014 - West Bank and Gaza [Dataset]. https://datacatalog.ihsn.org/catalog/9840
Explore at:
Dataset updated
Oct 14, 2021
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2014
Area covered
Palestine, Gaza, Gaza Strip, West Bank
Description
Abstract

Within the frame of PCBS' efforts in providing official Palestinian statistics in the different life aspects of Palestinian society and because the wide spread of Computer, Internet and Mobile Phone among the Palestinian people, and the important role they may play in spreading knowledge and culture and contribution in formulating the public opinion, PCBS conducted the Household Survey on Information and Communications Technology, 2014.

The main objective of this survey is to provide statistical data on Information and Communication Technology in the Palestine in addition to providing data on the following: - Prevalence of computers and access to the Internet. - Study the penetration and purpose of Technology use.

Geographic coverage

Palestine (West Bank and Gaza Strip), type of locality (urban, rural, refugee camps) and governorate.

Analysis unit

Household.

Persons 10 years and over .

Universe

All Palestinian households and individuals whose usual place of residence in Palestine with focus on persons aged 10 years and over in year 2014.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of a list of enumeration areas adopted in the Population, Housing and Establishments Census of 2007. Each enumeration area has an average size of about 124 households. These were used in the first phase as Preliminary Sampling Units in the process of selecting the survey sample.

Sample Size The total sample size of the survey was 7,268 households, of which 6,000 responded.

Sample Design The sample is a stratified clustered systematic random sample. The design comprised three phases:

Phase I: Random sample of 240 enumeration areas. Phase II: Selection of 25 households from each enumeration area selected in phase one using systematic random selection. Phase III: Selection of an individual (10 years or more) in the field from the selected households; KISH TABLES were used to ensure indiscriminate selection.

Sample Strata Distribution of the sample was stratified by: 1- Governorate (16 governorates, J1). 2- Type of locality (urban, rural and camps).

Mode of data collection

Face-to-face [f2f]

Research instrument

The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on persons (aged 10 years and over) about computer use, access to the Internet and possession of a mobile phone.

Cleaning operations

Preparation of Data Entry Program: This stage included preparation of the data entry programs using an ACCESS package and defining data entry control rules to avoid errors, plus validation inquiries to examine the data after it had been captured electronically.

Data Entry: The data entry process started on the 8th of May 2014 and ended on the 23rd of June 2014. The data entry took place at the main PCBS office and in field offices using 28 data clerks.

Editing and Cleaning procedures: Several measures were taken to avoid non-sampling errors. These included editing of questionnaires before data entry to check field errors, using a data entry application that does not allow mistakes during the process of data entry, and then examining the data by using frequency and cross tables. This ensured that data were error free; cleaning and inspection of the anomalous values were conducted to ensure harmony between the different questions on the questionnaire.

Response rate

Response Rates: 79%

Sampling error estimates

There are many aspects of the concept of data quality; this includes the initial planning of the survey to the dissemination of the results and how well users understand and use the data. There are three components to the quality of statistics: accuracy, comparability, and quality control procedures.

Checks on data accuracy cover many aspects of the survey and include statistical errors due to the use of a sample, non-statistical errors resulting from field workers or survey tools, and response rates and their effect on estimations. This section includes:

Statistical Errors Data of this survey may be affected by statistical errors due to the use of a sample and not a complete enumeration. Therefore, certain differences can be expected in comparison with the real values obtained through censuses. Variances were calculated for the most important indicators.

Variance calculations revealed that there is no problem in disseminating results nationally or regionally (the West Bank, Gaza Strip), but some indicators show high variance by governorate, as noted in the tables of the main report.

Non-Statistical Errors Non-statistical errors are possible at all stages of the project, during data collection or processing. These are referred to as non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, and practical and theoretical training took place during the training course. Training manuals were provided for each section of the questionnaire, along with practical exercises in class and instructions on how to approach respondents to reduce refused cases. Data entry staff were trained on the data entry program, which was tested before starting the data entry process.

Several measures were taken to avoid non-sampling errors. These included editing of questionnaires before data entry to check field errors, using a data entry application that does not allow mistakes during the process of data entry, and then examining the data by using frequency and cross tables. This ensured that data were error free; cleaning and inspection of the anomalous values were conducted to ensure harmony between the different questions on the questionnaire.

The sources of non-statistical errors can be summarized as: 1. Some of the households were not at home and could not be interviewed, and some households refused to be interviewed. 2. In unique cases, errors occurred due to the way the questions were asked by interviewers and respondents misunderstood some of the questions.
Surveys of Data Professionals (Alex the Analyst)
kaggle.com
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Stewie
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[Dataset Name] - About This Dataset

Overview

This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

Context

The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

Content

The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

Acknowledgements

The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

Citation

If you use this dataset in your work, please cite it as follows:

How to Use

Download the dataset from this link.

Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

Feel free to reach out for any additional information or clarification. Happy analyzing!
MOESM4 of A new semi-automated workflow for chemical data retrieval and...
springernature.figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Domenico Gadaleta; Anna Lombardo; Cosimo Toma; Emilio Benfenati (2023). MOESM4 of A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications [Dataset]. http://doi.org/10.6084/m9.figshare.7446905.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7446905.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Domenico Gadaleta; Anna Lombardo; Cosimo Toma; Emilio Benfenati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 4. Â Output of the curation procedure for the EPISuiteâ„˘ solubility dataset.
m
Data from: Datasets for lot sizing and scheduling problems in the...
data.mendeley.com
narcis.nl
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
Explore at:
Unique identifier
https://doi.org/10.17632/j2x3gbskfw.1
Dataset updated
Jan 19, 2021
Authors
Juan Piñeros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
h
Seaskull
huggingface.co
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaleido Singapore (2023). Seaskull [Dataset]. https://huggingface.co/datasets/KaleidoSG/Seaskull
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2023
Dataset authored and provided by
Kaleido Singapore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Seaskull Dataset

Description

The Seaskull dataset follows the same format and purpose as the Helix dataset but contains distinct data. It has undergone a cleaning process to ensure data quality and usability.

Dataset Details

Source Dataset: Private Haribon dataset Data Cleaning: The Seaskull dataset has been cleaned to eliminate Null and NaN values, ensuring data reliability.

License

Please adhere to the licensing terms provided by the dataset… See the full description on the dataset page: https://huggingface.co/datasets/KaleidoSG/Seaskull.
m
Dataset Question Answering for Admission of Higher Education Institution
data.mendeley.com
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emny Yossy (2023). Dataset Question Answering for Admission of Higher Education Institution [Dataset]. http://doi.org/10.17632/jc4df8srcb.2
Explore at:
Unique identifier
https://doi.org/10.17632/jc4df8srcb.2
Dataset updated
Sep 26, 2023
Authors
Emny Yossy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.

Facebook

Twitter

Click to copy link

Link copied

Cite

kenanyafi (2024). A Journey through Data Cleaning [Dataset]. https://www.kaggle.com/datasets/kenanyafi/a-journey-through-data-cleaning

A Journey through Data Cleaning

Streamlining Data for Enhanced Analysis and Decision-Making

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 22, 2024

Authors

kenanyafi

Description

Embark on a transformative journey with our Data Cleaning Project, where we meticulously refine and polish raw data into valuable insights. Our project focuses on streamlining data sets, removing inconsistencies, and ensuring accuracy to unlock its full potential.

Through advanced techniques and rigorous processes, we standardize formats, address missing values, and eliminate duplicates, creating a clean and reliable foundation for analysis. By enhancing data quality, we empower organizations to make informed decisions, drive innovation, and achieve strategic objectives with confidence.

Join us as we embark on this essential phase of data preparation, paving the way for more accurate and actionable insights that fuel success."

Clear search

Close search

Google apps

Main menu

A Journey through Data Cleaning

B2B Data Cleansing Services - Verified Records - Updated Every 30 Days

Enviro-Champs Formshare Data Cleaning Tool

Data Cleansing Software Market Report | Global Forecast From 2025 To 2033

Data Cleansing Software Market Outlook

Component Analysis

Data Cleaning Sample

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Full Dataset prior to Cleaning

Data Cleaning with OpenRefine

Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...

Data Center Cleaning Service Report

Data Wrangling Market Analysis North America, Europe, APAC, Middle East and...

Snapshot img

LSC (Leicester Scientific Corpus)

Data to Impact on various cleaning procedures on p-GaN surfaces

Household Survey on Information and Communications Technology– 2019 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Household Survey on Information and Communications Technology 2014 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Surveys of Data Professionals (Alex the Analyst)

[Dataset Name] - About This Dataset

Overview

Context

Content

Acknowledgements

Citation

How to Use

MOESM4 of A new semi-automated workflow for chemical data retrieval and...

Data from: Datasets for lot sizing and scheduling problems in the...

Seaskull

Dataset Question Answering for Admission of Higher Education Institution

A Journey through Data Cleaning

Streamlining Data for Enhanced Analysis and Decision-Making