83 datasets found

A Journey through Data Cleaning
kaggle.com
zip
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kenanyafi (2024). A Journey through Data Cleaning [Dataset]. https://www.kaggle.com/datasets/kenanyafi/a-journey-through-data-cleaning
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 22, 2024
Authors
kenanyafi
Description
Embark on a transformative journey with our Data Cleaning Project, where we meticulously refine and polish raw data into valuable insights. Our project focuses on streamlining data sets, removing inconsistencies, and ensuring accuracy to unlock its full potential.

Through advanced techniques and rigorous processes, we standardize formats, address missing values, and eliminate duplicates, creating a clean and reliable foundation for analysis. By enhancing data quality, we empower organizations to make informed decisions, drive innovation, and achieve strategic objectives with confidence.

Join us as we embark on this essential phase of data preparation, paving the way for more accurate and actionable insights that fuel success."
d
B2B Data Cleansing Services - Verified Records - Updated Every 30 Days
datarade.ai
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomson Data (2022). B2B Data Cleansing Services - Verified Records - Updated Every 30 Days [Dataset]. https://datarade.ai/data-products/thomson-data-hr-data-reach-hr-professionals-across-the-world-thomson-data
Explore at:
.csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 8, 2022
Dataset authored and provided by
Thomson Data
Area covered
Panama, Zimbabwe, Czech Republic, Palau, Denmark, Andorra, Micronesia (Federated States of), Bulgaria, Finland, Eritrea
Description
At Thomson Data, we help businesses clean up and manage messy B2B databases to ensure they are up-to-date, correct, and detailed. We believe your sales development representatives and marketing representatives should focus on building meaningful relationships with prospects, not scrubbing through bad data.

Here are the key steps involved in our B2B data cleansing process:

Data Auditing: We begin with a thorough audit of the database to identify errors, gaps, and inconsistencies, which majorly revolve around identifying outdated, incomplete, and duplicate information.

Data Standardization: Ensuring consistency in the data records is one of our prime services; it includes standardizing job titles, addresses, and company names. It ensures that they can be easily shared and used by different teams.

Data Deduplication: Another way we improve efficiency is by removing all duplicate records. Data deduplication is important in a large B2B dataset as multiple records from the same company may exist in the database.

Data Enrichment: After the first three steps, we enrich your data, fill in the missing details, and then enhance the database with up-to-date records. This is the step that ensures the database is valuable, providing insights that are actionable and complete.

What are the Key Benefits of Keeping the Data Clean with Thomson Data’s B2B Data Cleansing Service? Once you understand the benefits of our data cleansing service, it will entice you to optimize your data management practices, and it will additionally help you stay competitive in today’s data-driven market.

Here are some advantages of maintaining a clean database with Thomson Data:

Better ROI for your Sales and Marketing Campaigns: Our clean data will magnify your precise targeting, enabling you to strategize for effective campaigns, increased conversion rate, and ROI.

Compliant with Data Regulations:
The B2B data cleansing services we provide are compliant to global data norms.

Streamline Operations: Your efforts are directed in the right channel when your data is clean and accurate, as your team doesn’t have to spend their valuable time fixing errors.

To summarize, we would again bring your attention to how accurate data is essential for driving sales and marketing in a B2B environment. It enhances your business prowess in the avenues of decision-making and customer relationships. Therefore, it is better to have a proactive approach toward B2B data cleansing service and outsource our offerings to stay competitive by unlocking the full potential of your data.

Send us a request and we will be happy to assist you.
Data Cleansing Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Cleansing Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-cleansing-software-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Cleansing Software Market Outlook

The global data cleansing software market size was valued at approximately USD 1.5 billion in 2023 and is projected to reach around USD 4.2 billion by 2032, exhibiting a compound annual growth rate (CAGR) of 12.5% during the forecast period. This substantial growth can be attributed to the increasing importance of maintaining clean and reliable data for business intelligence and analytics, which are driving the adoption of data cleansing solutions across various industries.

The proliferation of big data and the growing emphasis on data-driven decision-making are significant growth factors for the data cleansing software market. As organizations collect vast amounts of data from multiple sources, ensuring that this data is accurate, consistent, and complete becomes critical for deriving actionable insights. Data cleansing software helps organizations eliminate inaccuracies, inconsistencies, and redundancies, thereby enhancing the quality of their data and improving overall operational efficiency. Additionally, the rising adoption of advanced analytics and artificial intelligence (AI) technologies further fuels the demand for data cleansing software, as clean data is essential for the accuracy and reliability of these technologies.

Another key driver of market growth is the increasing regulatory pressure for data compliance and governance. Governments and regulatory bodies across the globe are implementing stringent data protection regulations, such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States. These regulations mandate organizations to ensure the accuracy and security of the personal data they handle. Data cleansing software assists organizations in complying with these regulations by identifying and rectifying inaccuracies in their data repositories, thus minimizing the risk of non-compliance and hefty penalties.

The growing trend of digital transformation across various industries also contributes to the expanding data cleansing software market. As businesses transition to digital platforms, they generate and accumulate enormous volumes of data. To derive meaningful insights and maintain a competitive edge, it is imperative for organizations to maintain high-quality data. Data cleansing software plays a pivotal role in this process by enabling organizations to streamline their data management practices and ensure the integrity of their data. Furthermore, the increasing adoption of cloud-based solutions provides additional impetus to the market, as cloud platforms facilitate seamless integration and scalability of data cleansing tools.

Regionally, North America holds a dominant position in the data cleansing software market, driven by the presence of numerous technology giants and the rapid adoption of advanced data management solutions. The region is expected to continue its dominance during the forecast period, supported by the strong emphasis on data quality and compliance. Europe is also a significant market, with countries like Germany, the UK, and France showing substantial demand for data cleansing solutions. The Asia Pacific region is poised for significant growth, fueled by the increasing digitalization of businesses and the rising awareness of data quality's importance. Emerging economies in Latin America and the Middle East & Africa are also expected to witness steady growth, driven by the growing adoption of data-driven technologies.

The role of Data Quality Tools cannot be overstated in the context of data cleansing software. These tools are integral in ensuring that the data being processed is not only clean but also of high quality, which is crucial for accurate analytics and decision-making. Data Quality Tools help in profiling, monitoring, and cleansing data, thereby ensuring that organizations can trust their data for strategic decisions. As organizations increasingly rely on data-driven insights, the demand for robust Data Quality Tools is expected to rise. These tools offer functionalities such as data validation, standardization, and enrichment, which are essential for maintaining the integrity of data across various platforms and applications. The integration of these tools with data cleansing software enhances the overall data management capabilities of organizations, enabling them to achieve greater operational efficiency and compliance with data regulations.

Component Analysis

The data cle
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
d
Enviro-Champs Formshare Data Cleaning Tool
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Udhav Maharaj (2024). Enviro-Champs Formshare Data Cleaning Tool [Dataset]. http://doi.org/10.7910/DVN/EA5MOI
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EA5MOI
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Udhav Maharaj
Time period covered
Jan 1, 2023 - Jan 1, 2024
Description
A data cleaning tool customised for cleaning and sorting the data generated during the Enviro-Champs pilot study as they are downloaded from Formshare, the platform capturing data sent from a customised ODK Collect form collection app. The dataset inclues the latest data from the pilot study as at 14 May 2024.

Restaurant Sales-Dirty Data for Cleaning Training

kaggle.com

Updated Jan 25, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 25, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Restaurant Sales Dataset with Dirt Documentation

Overview

The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

Dataset Use Cases

This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

Columns Description

Column Name	Description	Example Values
`Order ID`	A unique identifier for each order.	`ORD_123456`
`Customer ID`	A unique identifier for each customer.	`CUST_001`
`Category`	The category of the purchased item.	`Main Dishes`, `Drinks`
`Item`	The name of the purchased item. May contain missing values due to data dirt.	`Grilled Chicken`, `None`
`Price`	The static price of the item. May contain missing values.	`15.0`, `None`
`Quantity`	The quantity of the purchased item. May contain missing values.	`1`, `None`
`Order Total`	The total price for the order (`Price * Quantity`). May contain missing values.	`45.0`, `None`
`Order Date`	The date when the order was placed. Always present.	`2022-01-15`
`Payment Method`	The payment method used for the transaction. May contain missing values due to data dirt.	`Cash`, `None`

Key Characteristics

Data Dirtiness:
- Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
- At least one of the following conditions is ensured for each record to identify an item:
  - Item is present.
  - Price is present.
  - Both Quantity and Order Total are present.
- If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
Menu Categories and Items:
- Items are divided into five categories:
  - Starters: E.g., Chicken Melt, French Fries.
  - Main Dishes: E.g., Grilled Chicken, Steak.
  - Desserts: E.g., Chocolate Cake, Ice Cream.
  - Drinks: E.g., Coca Cola, Water.
  - Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

Cleaning Suggestions

Handle Missing Values:
- Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
- Deduce missing Price from Order Total / Quantity if both are available.
Validate Data Consistency:
- Ensure that calculated values (Order Total = Price * Quantity) match.
Analyze Missing Patterns:
- Study the distribution of missing values across categories and payment methods.

Menu Map with Prices and Categories

Category	Item	Price
Starters	Chicken Melt	8.0
Starters	French Fries	4.0
Starters	Cheese Fries	5.0
Starters	Sweet Potato Fries	5.0
Starters	Beef Chili	7.0
Starters	Nachos Grande	10.0
Main Dishes	Grilled Chicken	15.0
Main Dishes	Steak	20.0
Main Dishes	Pasta Alfredo	12.0
Main Dishes	Salmon	18.0
Main Dishes	Vegetarian Platter	14.0
Desserts	Chocolate Cake	6.0
Desserts	Ice Cream	5.0
Desserts	Fruit Salad	4.0
Desserts	Cheesecake	7.0
Desserts	Brownie	6.0
Drinks	Coca Cola	2.5
Drinks	Orange Juice	3.0
Drinks ...

D
Data Cleansing Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Cleansing Software Report [Dataset]. https://www.archivemarketresearch.com/reports/data-cleansing-software-44630
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Feb 23, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The data cleansing software market is expanding rapidly, with a market size of XXX million in 2023 and a projected CAGR of XX% from 2023 to 2033. This growth is driven by the increasing need for accurate and reliable data in various industries, including healthcare, finance, and retail. Key market trends include the growing adoption of cloud-based solutions, the increasing use of artificial intelligence (AI) and machine learning (ML) to automate the data cleansing process, and the increasing demand for data governance and compliance. The market is segmented by deployment type (cloud-based vs. on-premise) and application (large enterprises vs. SMEs vs. government agencies). Major players in the market include IBM, SAS Institute Inc, SAP SE, Trifacta, OpenRefine, Data Ladder, Analytics Canvas (nModal Solutions Inc.), Mo-Data, Prospecta, WinPure Ltd, Symphonic Source Inc, MuleSoft, MapR Technologies, V12 Data, and Informatica. This report provides a comprehensive overview of the global data cleansing software market, with a focus on market concentration, product insights, regional insights, trends, driving forces, challenges and restraints, growth catalysts, leading players, and significant developments.
Data Cleansing Tools Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Cleansing Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-cleansing-tools-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Cleansing Tools Market Outlook

The global data cleansing tools market size was valued at approximately USD 1.5 billion in 2023 and is projected to reach USD 4.2 billion by 2032, growing at a CAGR of 12.1% from 2024 to 2032. One of the primary growth factors driving the market is the increasing need for high-quality data in various business operations and decision-making processes.

The surge in big data and the subsequent increased reliance on data analytics are significant factors propelling the growth of the data cleansing tools market. Organizations increasingly recognize the value of high-quality data in driving strategic initiatives, customer relationship management, and operational efficiency. The proliferation of data generated across different sectors such as healthcare, finance, retail, and telecommunications necessitates the adoption of tools that can clean, standardize, and enrich data to ensure its reliability and accuracy.

Furthermore, the rising adoption of Machine Learning (ML) and Artificial Intelligence (AI) technologies has underscored the importance of clean data. These technologies rely heavily on large datasets to provide accurate and reliable insights. Any errors or inconsistencies in data can lead to erroneous outcomes, making data cleansing tools indispensable. Additionally, regulatory and compliance requirements across various industries necessitate the maintenance of clean and accurate data, further driving the market for data cleansing tools.

The growing trend of digital transformation across industries is another critical growth factor. As businesses increasingly transition from traditional methods to digital platforms, the volume of data generated has skyrocketed. However, this data often comes from disparate sources and in various formats, leading to inconsistencies and errors. Data cleansing tools are essential in such scenarios to integrate data from multiple sources and ensure its quality, thus enabling organizations to derive actionable insights and maintain a competitive edge.

In the context of ensuring data reliability and accuracy, Data Quality Software and Solutions play a pivotal role. These solutions are designed to address the challenges associated with managing large volumes of data from diverse sources. By implementing robust data quality frameworks, organizations can enhance their data governance strategies, ensuring that data is not only clean but also consistent and compliant with industry standards. This is particularly crucial in sectors where data-driven decision-making is integral to business success, such as finance and healthcare. The integration of advanced data quality solutions helps businesses mitigate risks associated with poor data quality, thereby enhancing operational efficiency and strategic planning.

Regionally, North America is expected to hold the largest market share due to the early adoption of advanced technologies, robust IT infrastructure, and the presence of key market players. Europe is also anticipated to witness substantial growth due to stringent data protection regulations and the increasing adoption of data-driven decision-making processes. Meanwhile, the Asia Pacific region is projected to experience the highest growth rate, driven by the rapid digitalization of emerging economies, the expansion of the IT and telecommunications sector, and increasing investments in data management solutions.

Component Analysis

The data cleansing tools market is segmented into software and services based on components. The software segment is anticipated to dominate the market due to its extensive use in automating the data cleansing process. The software solutions are designed to identify, rectify, and remove errors in data sets, ensuring data accuracy and consistency. They offer various functionalities such as data profiling, validation, enrichment, and standardization, which are critical in maintaining high data quality. The high demand for these functionalities across various industries is driving the growth of the software segment.

On the other hand, the services segment, which includes professional services and managed services, is also expected to witness significant growth. Professional services such as consulting, implementation, and training are crucial for organizations to effectively deploy and utilize data cleansing tools. As businesses increasingly realize the importance of clean data, the demand for expert
f
The mean preservation of data (PD), sensitivity, specificity and convergence...
plos.figshare.com
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). The mean preservation of data (PD), sensitivity, specificity and convergence rate across different rates and types of simulated errors and duplications of uncleaned, de-duplicated and data cleaned with five data cleaning approaches with and without our algorithm (A) for longitudinal growth measurements from CLOSER data. [Dataset]. http://doi.org/10.1371/journal.pone.0228154.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228154.t006
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mean preservation of data (PD), sensitivity, specificity and convergence rate across different rates and types of simulated errors and duplications of uncleaned, de-duplicated and data cleaned with five data cleaning approaches with and without our algorithm (A) for longitudinal growth measurements from CLOSER data.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
d
Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...
datarade.ai
.json, .csv, .xls
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quadrant (2025). Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users | +200B Events / Month [Dataset]. https://datarade.ai/data-products/mobile-location-data-asia-300m-unique-devices-100m-da-quadrant
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Mar 21, 2025
Dataset authored and provided by
Quadrant
Area covered
Korea (Democratic People's Republic of), Kyrgyzstan, Philippines, Armenia, Israel, Palestine, Bahrain, Iran (Islamic Republic of), Oman, Georgia, Asia
Description
Quadrant provides Insightful, accurate, and reliable mobile location data.

Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.

These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.

We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.

We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.

Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.

Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.
f
The mean, standard deviation, preservation of data (PD), sensitivity and...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). The mean, standard deviation, preservation of data (PD), sensitivity and specificity of five data cleaning approaches with and without an algorithm (A) compared to uncleaned longitudinal growth measurements in CLOSER data with and without simulated duplications and 1% errors. [Dataset]. http://doi.org/10.1371/journal.pone.0228154.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228154.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mean, standard deviation, preservation of data (PD), sensitivity and specificity of five data cleaning approaches with and without an algorithm (A) compared to uncleaned longitudinal growth measurements in CLOSER data with and without simulated duplications and 1% errors.
i
Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...
catalog.ihsn.org
datacatalog.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Hashemite Kingdom of Jordan Department of Statistics (DOS) (2019). Household Expenditure and Income Survey 2010, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7662
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
The Hashemite Kingdom of Jordan Department of Statistics (DOS)
Time period covered
2010 - 2011
Area covered
Jordan
Description
Abstract

The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

Geographic coverage

National

Analysis unit

Households

Individuals

Kind of data

Sample survey data [ssd]

Sampling procedure

The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.

A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.

It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.

Mode of data collection

Face-to-face [f2f]

Research instrument

General form

Expenditure on food commodities form

Expenditure on non-food commodities form

Cleaning operations

Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.

Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
Best IPL Data Set
kaggle.com
Updated Sep 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhodeep Das (2020). Best IPL Data Set [Dataset]. https://www.kaggle.com/datasets/theuniversesd/ipl-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subhodeep Das
Description
Dataset

This dataset was created by Subhodeep Das

Released under Other (specified in description)

Contents
o
Data Cleaning with OpenRefine
explore.openaire.eu
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Ye (2020). Data Cleaning with OpenRefine [Dataset]. http://doi.org/10.5281/zenodo.6863001
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6863001
Dataset updated
Nov 9, 2020
Authors
Hao Ye
Description
OpenRefine (formerly Google Refine) is a powerful free and open source tool for data cleaning, enabling you to correct errors in the data, and make sure that the values and formatting are consistent. In addition, OpenRefine records your processing steps, enabling you to apply the same cleaning procedure to other data, and enhancing the reproducibility of your analysis. This workshop will teach you to use OpenRefine to clean and format data and automatically track any changes that you make.
f
The percentage of gold standard corrections of errors induced into CLOSER...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). The percentage of gold standard corrections of errors induced into CLOSER data with simulated duplications and 1% errors using the algorithmic data cleaning methods. [Dataset]. http://doi.org/10.1371/journal.pone.0228154.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228154.t005
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The percentage of gold standard corrections of errors induced into CLOSER data with simulated duplications and 1% errors using the algorithmic data cleaning methods.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

Cleaned Retail Customer Dataset (SQL-based ETL)

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl/versions/2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rizwan Bin Akbar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Description

This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

This table contains information about customers, including their unique identifiers and demographic details.

Columns:

  cst_id: Customer ID (Primary Key)

  cst_gndr: Gender

  cst_marital_status: Marital status

  cst_create_date: Customer account creation date

Cleaning Steps:

  Removed duplicates and handled missing or null cst_id values.

  Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.

  Standardized gender values and identified inconsistencies in marital status.

Product Information (s_crm_prd_info / b_crm_prd_info)

This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

Columns:

  prd_id: Product ID

  prd_key: Product key

  prd_nm: Product name

  prd_cost: Product cost

  prd_start_dt: Product start date

  prd_end_dt: Product end date

Cleaning Steps:

  Checked for duplicates and null values in the prd_key column.

  Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.

  Corrected product costs to remove invalid entries (e.g., negative values).

Sales Details (s_crm_sales_details / b_crm_sales_details)

This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

Columns:

  sls_order_dt: Sales order date

  sls_due_dt: Sales due date

  sls_sales: Total sales amount

  sls_quantity: Number of products sold

  sls_price: Product unit price

Cleaning Steps:

  Validated sales order dates and corrected invalid entries.

  Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.

  Removed null and negative values from sls_sales, sls_quantity, and sls_price.

ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

This table contains additional customer demographic data, including gender and birthdate.

Columns:

  cid: Customer ID

  gen: Gender

  bdate: Birthdate

Cleaning Steps:

  Checked for missing or null gender values and standardized inconsistent entries.

  Removed leading/trailing spaces from gen and bdate.

  Validated birthdates to ensure they were within a realistic range.

Location Information (b_erp_loc_a101)

This table contains country information related to the customers' locations.

Columns:

  cntry: Country

Cleaning Steps:

  Standardized country names (e.g., "US" and "USA" were mapped to "United States").

  Removed special characters (e.g., carriage returns) and trimmed whitespace.

Product Category (b_erp_px_cat_g1v2)

This table contains product category information.

Columns:

  Product category data (no significant cleaning required).

Key Features:

Customer demographics, including gender and marital status

Product details such as cost, start date, and end date

Sales data with order dates, quantities, and sales amounts

ERP-specific customer and location data

Data Cleaning Process:

This dataset underwent extensive cleaning and validation, including:

Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).

Date Validations: Ensuring correct date ranges and chronological consistency.

Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.

Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.

This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

Household Survey on Information and Communications Technology– 2019 - West...
pcbs.gov.ps
Updated Mar 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2020). Household Survey on Information and Communications Technology– 2019 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/489
Explore at:
Dataset updated
Mar 16, 2020
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2019
Area covered
West Bank, Gaza, Gaza Strip
Description
Abstract

The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

Geographic coverage

Palestine, West Bank, Gaza strip

Analysis unit

Household, Individual

Universe

All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

Sample size The estimated sample size is 8,040 households.

Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.

Cleaning operations

Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.

Response rate

The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.

Sampling error estimates

Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
Olist Cleaned files for MYSQL Data Base
kaggle.com
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanu prasad Chouki (2024). Olist Cleaned files for MYSQL Data Base [Dataset]. https://www.kaggle.com/datasets/bhanuprasadchouki/olist-cleaned-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bhanu prasad Chouki
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description: Clean and Ready for Relational Database Import This dataset is a comprehensive collection of well-structured and meticulously cleaned data, meticulously prepared for seamless integration into a relational database. The dataset has undergone thorough data cleansing procedures to ensure that it is free from inconsistencies, missing values, and duplicate records. This guarantees a smooth and efficient data analysis experience for users, without the need for additional preprocessing steps.

Facebook

Twitter

Click to copy link

Link copied

Cite

kenanyafi (2024). A Journey through Data Cleaning [Dataset]. https://www.kaggle.com/datasets/kenanyafi/a-journey-through-data-cleaning

A Journey through Data Cleaning

Streamlining Data for Enhanced Analysis and Decision-Making

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 22, 2024

Authors

kenanyafi

Description

Embark on a transformative journey with our Data Cleaning Project, where we meticulously refine and polish raw data into valuable insights. Our project focuses on streamlining data sets, removing inconsistencies, and ensuring accuracy to unlock its full potential.

Through advanced techniques and rigorous processes, we standardize formats, address missing values, and eliminate duplicates, creating a clean and reliable foundation for analysis. By enhancing data quality, we empower organizations to make informed decisions, drive innovation, and achieve strategic objectives with confidence.

Join us as we embark on this essential phase of data preparation, paving the way for more accurate and actionable insights that fuel success."

Clear search

Close search

Google apps

Main menu

A Journey through Data Cleaning

B2B Data Cleansing Services - Verified Records - Updated Every 30 Days

Data Cleansing Software Market Report | Global Forecast From 2025 To 2033

Data Cleansing Software Market Outlook

Component Analysis

Data Cleaning Sample

Enviro-Champs Formshare Data Cleaning Tool

Restaurant Sales-Dirty Data for Cleaning Training

Restaurant Sales Dataset with Dirt Documentation

Overview

Dataset Use Cases

Columns Description

Key Characteristics

Cleaning Suggestions

Menu Map with Prices and Categories

Data Cleansing Software Report

Data Cleansing Tools Market Report | Global Forecast From 2025 To 2033

Data Cleansing Tools Market Outlook

Component Analysis

The mean preservation of data (PD), sensitivity, specificity and convergence...

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...

The mean, standard deviation, preservation of data (PD), sensitivity and...

Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Best IPL Data Set

Dataset

Contents

Data Cleaning with OpenRefine

The percentage of gold standard corrections of errors induced into CLOSER...

LSC (Leicester Scientific Corpus)

Cleaned Retail Customer Dataset (SQL-based ETL)

Household Survey on Information and Communications Technology– 2019 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Olist Cleaned files for MYSQL Data Base

A Journey through Data Cleaning

Streamlining Data for Enhanced Analysis and Decision-Making