At Thomson Data, we help businesses clean up and manage messy B2B databases to ensure they are up-to-date, correct, and detailed. We believe your sales development representatives and marketing representatives should focus on building meaningful relationships with prospects, not scrubbing through bad data.
Here are the key steps involved in our B2B data cleansing process:
Data Auditing: We begin with a thorough audit of the database to identify errors, gaps, and inconsistencies, which majorly revolve around identifying outdated, incomplete, and duplicate information.
Data Standardization: Ensuring consistency in the data records is one of our prime services; it includes standardizing job titles, addresses, and company names. It ensures that they can be easily shared and used by different teams.
Data Deduplication: Another way we improve efficiency is by removing all duplicate records. Data deduplication is important in a large B2B dataset as multiple records from the same company may exist in the database.
Data Enrichment: After the first three steps, we enrich your data, fill in the missing details, and then enhance the database with up-to-date records. This is the step that ensures the database is valuable, providing insights that are actionable and complete.
What are the Key Benefits of Keeping the Data Clean with Thomson Data’s B2B Data Cleansing Service? Once you understand the benefits of our data cleansing service, it will entice you to optimize your data management practices, and it will additionally help you stay competitive in today’s data-driven market.
Here are some advantages of maintaining a clean database with Thomson Data:
Better ROI for your Sales and Marketing Campaigns: Our clean data will magnify your precise targeting, enabling you to strategize for effective campaigns, increased conversion rate, and ROI.
Compliant with Data Regulations:
The B2B data cleansing services we provide are compliant to global data norms.
Streamline Operations: Your efforts are directed in the right channel when your data is clean and accurate, as your team doesn’t have to spend their valuable time fixing errors.
To summarize, we would again bring your attention to how accurate data is essential for driving sales and marketing in a B2B environment. It enhances your business prowess in the avenues of decision-making and customer relationships. Therefore, it is better to have a proactive approach toward B2B data cleansing service and outsource our offerings to stay competitive by unlocking the full potential of your data.
Send us a request and we will be happy to assist you.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Data Science Platform Market Size 2025-2029
The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.
What will be the Size of the Data Science Platform Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection.
Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.
How is this Data Science Platform Industry segmented?
The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)
By Deployment Insights
The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
OpenRefine (formerly Google Refine) is a powerful free and open source tool for data cleaning, enabling you to correct errors in the data, and make sure that the values and formatting are consistent. In addition, OpenRefine records your processing steps, enabling you to apply the same cleaning procedure to other data, and enhancing the reproducibility of your analysis. This workshop will teach you to use OpenRefine to clean and format data and automatically track any changes that you make.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599
Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.
Palestine, West Bank, Gaza strip
Household, Individual
All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.
Sample survey data [ssd]
Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.
Sample size The estimated sample size is 8,040 households.
Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.
Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).
Computer Assisted Personal Interview [capi]
Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.
Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.
Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.
Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.
Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.
In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.
Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.
Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.
The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.
Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.
Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.
The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
National Labor Force Survey (SAKERNAS) is a survey that is designed to observe the general situation of workforce and also to understand whether there is a change of workforce structure between the enumeration period. Since the survey was initiated in 1976, it has undergone a series of changes affecting its coverage, the frequency of enumeration, the number of households sampled and the type of information collected. It is the largest and most representative source of employment data in Indonesia. For each selected household, the general information about the circumstances of each household member that includes the name, relationship to head of household, sex, and age were collected. Household members aged 10 years and over will be prompted to give the information about their marital status, education and employment.
SAKERNAS is aimed to gather informations that meet three objectives: 1.Employment by education, working hours, industrial classification and employment status, 2.Unemployment and underemployment by different characteristics and efforts on looking for work, 3.Working age population not in the labor force (e.g. attending schools, doing housekeeping and others).
The data for quarterly SAKERNAS was gathered in 1989 covered all provinces in Indonesia, with 65,440 households, scattered both in rural and urban areas and representative until provincial level. The main household data is taken from core questionnaire of SAK89-AK.
National coverage* including urban and rural area, representative until provincial level.
*) Although covering all of Indonesia, there are some circumstances when not all provincial were covered. For example, in year 2000, the Province of Maluku excluded in SAKERNAS because horizontal conflicts occurred there. Also, the separation of East Timor from Indonesia in year 1999 also changed the scope of SAKERNAS for the years to come. After that, due to the expansion of regional autonomy as a consequence, the proportion of samples per Province is also changed, as in 2006 when the number of provinces are already 33. However, the difference is only on the number of influential scope/level but not to the pattern. On the other hand, changes in the methodology (including sample size) over time is likely to affect the outcome, for example in years 2000 and 2001, when sample size is only 32.384 and 34.176 the level of data presentation is only representative to island level, (insufficient sample size even to make it representative to provincial level).
Individual
The survey covered all de jure household members (usual residents), aged 10 years and over that resident in the household. However, Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample.
Sample survey data
Quarterly SAKERNAS 1989 was implemented in the whole territory of the Republic of Indonesia , with a total sample of about 65,440 households, both in rural and urban areas and representative until provincial level. Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample. Data in the dataset indicates the combined sample data consisting results of the 4 rounds quarterly SAKERNAS in 1989, i.e. quarter I, quarter II, quarter III, and quarter IV.
Implementation of SAKERNAS 1989 include samples of the previous enumeration activities (rotation method). Sampling method* to be used is similar for implementation of SAKERNAS years 1986 to 1989, which households selected samples from previous quarter will be partly re-enumerated and then again partly from other household ever elected from another previous quarters, so no need to re-enroll in new household. The procedure for the selection of households in the sample are described in more detail in the enumerators/ supervisors manual document.
*) Sampling method used is varied in different years. For example, in SAKERNAS period of 1986-1989 sampling method used is the method of rotation, where most of the households selected at one period was re-elected in the following period. This often happens on quarterly SAKERNAS on that period. At other periods often use multi-stages sampling method (two or three stages depend on whether sub block census / segment group included or not), or a combination of multi stages sampling also with rotation method (e.g. SAKERNAS 2006-2010).
Face-to-face
In SAKERNAS, the questionnaire has been designed in a simple and concise way. It is expected that respondents will understand the aim of question of survey and avoid the memory lapse and uninterested respondents during data collection. Furthermore, the design of SAKERNAS's questionnaire remains stable in order to maintain data comparison.
A household questionnaire was administered in each selected household, which collected general information of household members that includes name, relationship with head of the household, sex and age. Household members aged 10 years and over were then asked about their marital status, education and occupation.
Stages of data processing in Sakernas are through process of: - Batching - Editing - Coding - Data Entry - Validation - Tabulation
Sampling error results are presented at the end of the publication of The State of Labor Force in Indonesia and in publication of The State of Workers in Indonesia.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content
The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.
Description
This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.
The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:
Studying the similarities between different types of wine.
Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities.
Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.
Key Features
Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy.
High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates.
Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis.
File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.
Usage
This dataset is useful for:
Data cleaning and preprocessing exercises.
Duplicate detection and handling techniques.
Exploring the impact of duplicates on data analysis and machine learning models.
Educational purposes for understanding the importance of data quality.
Studying similarities between different types of wine and their characteristics.
File Structure
1dd.json: red wine duplicate records.
1ddw.json wite wine duplicate records.
ddrw.json: A file containing information about 100% identical characteristics of red and white wines.
Acknowledgements
This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview and Contents This replication package was assembled in January of 2025. The code in this repository generates the 13 figures and content of the 3 tables for the paper “All Forecasters Are Not the Same: Systematic Patterns in Predictive Performance”. It also generates the 2 figures and content of the 5 tables in the appendix to this paper. The main contents of the repository are the following: Code/: folder of scripts to prepare and clean data as well as generate tables and figures. Functions/: folder of subroutines for use with MATLAB scripts. Data/: data folder. Raw/: ECB SPF forecast data, realizations of target variables, and start and end bins for density forecasts. Intermediate/: Data used at intermediate steps in the cleaning process. These datasets are generated with x01_Raw_Data_Shell.do, x02a_Individual_Uncertainty_GDP.do, x02b_Individual_Uncertainty_HICP.do, x02c_Individual_Uncertainty_Urate.do, x03_Pull_Data.do, x04_Data_Clean_And_Merge, and x05_Drop_Low_Counts.do in the Code/ folder. Ready/: Data used to conduct regressions, statistical tests, and generate figures. Output/: folder of results. Figures/: .jpg files for each figure used in the paper and its appendix. HL Results/: Results from applying the Hounyo and Lahiri (2023) testing procedure for equal predictive performance to ECB SPF forecast data. This folder contains the material for Tables 1A-4A. Regressions/: Regression results, as well as material for Tables 3 and 5A. Simulations/: Results from simulation exercise as well as the datasets used to create Figures 9-12. Statistical Tests/: Results displayed in Tables 1 and 2. The repository also contains the manuscript, appendix, and this read-me file.DisclaimerThis replication package was produced by the authors and is not an official product of the Federal Reserve Bank of Cleveland. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by the Federal Reserve Bank of Cleveland or the Federal Reserve System.
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE CENTRAL AGENCY FOR PUBLIC MOBILIZATION AND STATISTICS (CAPMAS)
In any society, the human element represents the basis of the work force which exercises all the service and production activities. Therefore, it is a mandate to produce labor force statistics and studies, that is related to the growth and distribution of manpower and labor force distribution by different types and characteristics.
In this context, the Central Agency for Public Mobilization and Statistics conducts "Quarterly Labor Force Survey" which includes data on the size of manpower and labor force (employed and unemployed) and their geographical distribution by their characteristics.
By the end of each year, CAPMAS issues the annual aggregated labor force bulletin publication that includes the results of the quarterly survey rounds that represent the manpower and labor force characteristics during the year.
----> Historical Review of the Labor Force Survey:
1- The First Labor Force survey was undertaken in 1957. The first round was conducted in November of that year, the survey continued to be conducted in successive rounds (quarterly, bi-annually, or annually) till now.
2- Starting the October 2006 round, the fieldwork of the labor force survey was developed to focus on the following two points: a. The importance of using the panel sample that is part of the survey sample, to monitor the dynamic changes of the labor market. b. Improving the used questionnaire to include more questions, that help in better defining of relationship to labor force of each household member (employed, unemployed, out of labor force ...etc.). In addition to re-order of some of the already existing questions in much logical way.
3- Starting the January 2008 round, the used methodology was developed to collect more representative sample during the survey year. this is done through distributing the sample of each governorate into five groups, the questionnaires are collected from each of them separately every 15 days for 3 months (in the middle and the end of the month)
----> The survey aims at covering the following topics:
1- Measuring the size of the Egyptian labor force among civilians (for all governorates of the republic) by their different characteristics. 2- Measuring the employment rate at national level and different geographical areas. 3- Measuring the distribution of employed people by the following characteristics: gender, age, educational status, occupation, economic activity, and sector. 4- Measuring unemployment rate at different geographic areas. 5- Measuring the distribution of unemployed people by the following characteristics: gender, age, educational status, unemployment type "ever employed/never employed", occupation, economic activity, and sector for people who have ever worked.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a sample of urban and rural areas in all the governorates.
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE CENTRAL AGENCY FOR PUBLIC MOBILIZATION AND STATISTICS (CAPMAS)
----> Sample Design and Selection
The sample of the LFS 2006 survey is a simple systematic random sample.
----> Sample Size
The sample size varied in each quarter (it is Q1=19429, Q2=19419, Q3=19119 and Q4=18835) households with a total number of 76802 households annually. These households are distributed on the governorate level (urban/rural).
A more detailed description of the different sampling stages and allocation of sample across governorates is provided in the Methodology document available among external resources in Arabic.
Face-to-face [f2f]
The questionnaire design follows the latest International Labor Organization (ILO) concepts and definitions of labor force, employment, and unemployment.
The questionnaire comprises 3 tables in addition to the identification and geographic data of household on the cover page.
----> Table 1- Demographic and employment characteristics and basic data for all household individuals
Including: gender, age, educational status, marital status, residence mobility and current work status
----> Table 2- Employment characteristics table
This table is filled by employed individuals at the time of the survey or those who were engaged to work during the reference week, and provided information on: - Relationship to employer: employer, self-employed, waged worker, and unpaid family worker - Economic activity - Sector - Occupation - Effective working hours - Work place - Average monthly wage
----> Table 3- Unemployment characteristics table
This table is filled by all unemployed individuals who satisfied the unemployment criteria, and provided information on: - Type of unemployment (unemployed, unemployed ever worked) - Economic activity and occupation in the last held job before being unemployed - Last unemployment duration in months - Main reason for unemployment
----> Raw Data
Office editing is one of the main stages of the survey. It started once the questionnaires were received from the field and accomplished by the selected work groups. It includes: a-Editing of coverage and completeness b-Editing of consistency
----> Harmonized Data
This activity provides the students with a basic introduction data science. Students will work their way through the process of downloading online data. data cleaning, analysis and presentation. Estimated total time is four hours, but can be easily broken into several blocks. Opportunities for formative and summative assessment at the end of each major step. This activity would fit well in courses for: ecology, environmental science/policy, water science. This activity uses citizen science data from the Chesapeake Water Watch program.
Cyclistic: Google Data Analytics Capstone Project
Cyclistic - Google Data Analytics Certification Capstone Project Moirangthem Arup Singh How Does a Bike-Share Navigate Speedy Success? Background: This project is for the Google Data Analytics Certification capstone project. I am wearing the hat of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. This project will be completed by using the 6 Data Analytics stages: Ask: Identify the business task and determine the key stakeholders. Prepare: Collect the data, identify how it’s organized, determine the credibility of the data. Process: Select the tool for data cleaning, check for errors and document the cleaning process. Analyze: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships. Share: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task. Act: Share the final conclusion and the recommendations. Ask: Business Task: Recommend marketing strategies aimed at converting casual riders into annual members by better understanding how annual members and casual riders use Cyclistic bikes differently. Stakeholders: Lily Moreno: The director of marketing and my manager. Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program. Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy. Prepare: For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under the license. I downloaded the ZIP files containing the csv files from the above link but while uploading the files in kaggle (as I am using kaggle notebook), it gave me a warning that the dataset is already available in kaggle. So I will be using the dataset cyclictic-bike-share dataset from kaggle. The dataset has 13 csv files from April 2020 to April 2021. For the purpose of my analysis I will use the csv files from April 2020 to March 2021. The source csv files are in Kaggle so I can rely on it's integrity. I am using Microsoft Excel to get a glimpse of the data. There is one csv file for each month and has information about the bike ride which contain details of the ride id, rideable type, start and end time, start and end station, latitude and longitude of the start and end stations. Process: I will use R as language in kaggle to import the dataset to check how it’s organized, whether all the columns have appropriate data type, find outliers and if any of these data have sampling bias. I will be using below R libraries
library(tidyverse) library(lubridate) library(ggplot2) library(plotrix) ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
setwd("/kaggle/input/cyclistic-bike-share")
r_202004 <- read.csv("202004-divvy-tripdata.csv") r_202005 <- read.csv("20...
The 2016 Integrated Household Panel Survey (IHPS) was launched in April 2016 as part of the Malawi Fourth Integrated Household Survey fieldwork operation. The IHPS 2016 targeted 1,989 households that were interviewed in the IHPS 2013 and that could be traced back to half of the 204 enumeration areas that were originally sampled as part of the Third Integrated Household Survey (IHS3) 2010/11. The 2019 IHPS was launched in April 2019 as part of the Malawi Fifth Integrated Household Survey fieldwork operations targeting the 2,508 households that were interviewed in 2016. The panel sample expanded each wave through the tracking of split-off individuals and the new households that they formed. Available as part of this project is the IHPS 2019 data, the IHPS 2016 data as well as the rereleased IHPS 2010 & 2013 data including only the subsample of 102 EAs with updated panel weights. Additionally, the IHPS 2016 was the first survey that received complementary financial and technical support from the Living Standards Measurement Study – Plus (LSMS+) initiative, which has been established with grants from the Umbrella Facility for Gender Equality Trust Fund, the World Bank Trust Fund for Statistical Capacity Building, and the International Fund for Agricultural Development, and is implemented by the World Bank Living Standards Measurement Study (LSMS) team, in collaboration with the World Bank Gender Group and partner national statistical offices. The LSMS+ aims to improve the availability and quality of individual-disaggregated household survey data, and is, at start, a direct response to the World Bank IDA18 commitment to support 6 IDA countries in collecting intra-household, sex-disaggregated household survey data on 1) ownership of and rights to selected physical and financial assets, 2) work and employment, and 3) entrepreneurship – following international best practices in questionnaire design and minimizing the use of proxy respondents while collecting personal information. This dataset is included here.
National coverage
The IHPS 2016 and 2019 attempted to track all IHPS 2013 households stemming from 102 of the original 204 baseline panel enumeration areas as well as individuals that moved away from the 2013 dwellings between 2013 and 2016 as long as they were neither servants nor guests at the time of the IHPS 2013; were projected to be at least 12 years of age and were known to be residing in mainland Malawi but excluding those in Likoma Island and in institutions, including prisons, police compounds, and army barracks.
Sample survey data [ssd]
A sub-sample of IHS3 2010 sample enumeration areas (EAs) (i.e. 204 EAs out of 768 EAs) was selected prior to the start of the IHS3 field work with the intention to (i) to track and resurvey these households in 2013 in accordance with the IHS3 fieldwork timeline and as part of the Integrated Household Panel Survey (IHPS 2013) and (ii) visit a total of 3,246 households in these EAs twice to reduce recall associated with different aspects of agricultural data collection. At baseline, the IHPS sample was selected to be representative at the national, regional, urban/rural levels and for each of the following 6 strata: (i) Northern Region - Rural, (ii) Northern Region - Urban, (iii) Central Region - Rural, (iv) Central Region - Urban, (v) Southern Region - Rural, and (vi) Southern Region - Urban. The IHPS 2013 main fieldwork took place during the period of April-October 2013, with residual tracking operations in November-December 2013.
Given budget and resource constraints, for the IHPS 2016 the number of sample EAs in the panel was reduced to 102 out of the 204 EAs. As a result, the domains of analysis are limited to the national, urban and rural areas. Although the results of the IHPS 2016 cannot be tabulated by region, the stratification of the IHPS by region, urban and rural strata was maintained. The IHPS 2019 tracked all individuals 12 years or older from the 2016 households.
Computer Assisted Personal Interview [capi]
Data Entry Platform To ensure data quality and timely availability of data, the IHPS 2019 was implemented using the World Bank’s Survey Solutions CAPI software. To carry out IHPS 2019, 1 laptop computer and a wireless internet router were assigned to each team supervisor, and each enumerator had an 8–inch GPS-enabled Lenovo tablet computer that the NSO provided. The use of Survey Solutions allowed for the real-time availability of data as the completed data was completed, approved by the Supervisor and synced to the Headquarters server as frequently as possible. While administering the first module of the questionnaire the enumerator(s) also used their tablets to record the GPS coordinates of the dwelling units. Geo-referenced household locations from that tablet complemented the GPS measurements taken by the Garmin eTrex 30 handheld devices and these were linked with publically available geospatial databases to enable the inclusion of a number of geospatial variables - extensive measures of distance (i.e. distance to the nearest market), climatology, soil and terrain, and other environmental factors - in the analysis.
Data Management The IHPS 2019 Survey Solutions CAPI based data entry application was designed to stream-line the data collection process from the field. IHPS 2019 Interviews were mainly collected in “sample” mode (assignments generated from headquarters) and a few in “census” mode (new interviews created by interviewers from a template) for the NSO to have more control over the sample. This hybrid approach was necessary to aid the tracking operations whereby an enumerator could quickly create a tracking assignment considering that they were mostly working in areas with poor network connection and hence could not quickly receive tracking cases from Headquarters.
The range and consistency checks built into the application was informed by the LSMS-ISA experience with the IHS3 2010/11, IHPS 2013 and IHPS 2016. Prior programming of the data entry application allowed for a wide variety of range and consistency checks to be conducted and reported and potential issues investigated and corrected before closing the assigned enumeration area. Headquarters (the NSO management) assigned work to the supervisors based on their regions of coverage. The supervisors then made assignments to the enumerators linked to their supervisor account. The work assignments and syncing of completed interviews took place through a Wi-Fi connection to the IHPS 2019 server. Because the data was available in real time it was monitored closely throughout the entire data collection period and upon receipt of the data at headquarters, data was exported to Stata for other consistency checks, data cleaning, and analysis.
Data Cleaning The data cleaning process was done in several stages over the course of fieldwork and through preliminary analysis. The first stage of data cleaning was conducted in the field by the field-based field teams utilizing error messages generated by the Survey Solutions application when a response did not fit the rules for a particular question. For questions that flagged an error, the enumerators were expected to record a comment within the questionnaire to explain to their supervisor the reason for the error and confirming that they double checked the response with the respondent. The supervisors were expected to sync the enumerator tablets as frequently as possible to avoid having many questionnaires on the tablet, and to enable daily checks of questionnaires. Some supervisors preferred to review completed interviews on the tablets so they would review prior to syncing but still record the notes in the supervisor account and reject questionnaires accordingly. The second stage of data cleaning was also done in the field, and this resulted from the additional error reports generated in Stata, which were in turn sent to the field teams via email or DropBox. The field supervisors collected reports for their assignments and in coordination with the enumerators reviewed, investigated, and collected errors. Due to the quick turn-around in error reporting, it was possible to conduct call-backs while the team was still operating in the EA when required. Corrections to the data were entered in the rejected questionnaires and sent back to headquarters.
The data cleaning process was done in several stages over the course of the fieldwork and through preliminary analyses. The first stage was during the interview itself. Because CAPI software was used, as enumerators asked the questions and recorded information, error messages were provided immediately when the information recorded did not match previously defined rules for that variable. For example, if the education level for a 12 year old respondent was given as post graduate. The second stage occurred during the review of the questionnaire by the Field Supervisor. The Survey Solutions software allows errors to remain in the data if the enumerator does not make a correction. The enumerator can write a comment to explain why the data appears to be incorrect. For example, if the previously mentioned 12 year old was, in fact, a genius who had completed graduate studies. The next stage occurred when the data were transferred to headquarters where the NSO staff would again review the data for errors and verify the comments from the
The war Israel launched on Gaza Strip led to the total destruction of the infrastructure of the Gaza Strip and turned that whole area into a disaster zone. More than 1,300 people were killed from the start of the war on December 27, 2008 until its end. Additionally, 5,400 people or more were wounded and more than 3,000 establishments and thousands of homes were destroyed. And tens of thousands of people were displaced.
The reconstruction of what the war had destroyed requires accurate and reliable data in all fields including socioeconomic, environmental, health, construction, and other fields. The purpose of conducting the Impact of the War and Siege on Gaza Strip Survey is to produce findings that would become a tool for the best planning for reconstruction and review of the situation of the Palestinian households in Gaza Strip from the economic, health, education, and environmental perspectives.
The survey is conducted in cooperation with a number of international organizations including WFP, FAO, in order to assess the impact of the war on the socioeconomic situations of the households in Gaza Strip and establish the necessary plans and policies to improve the socioeconomic, environmental, health, and educational aspects of the Gaza Strip.
GAZA STRIP
Household
The study group of the survey consists of the entire Palestinian households living in Gaza Strip in the aftermath of the last war (December 27, 2008 - January 17, 2009)
Sample survey data [ssd]
The study group of the survey consists of the entire Palestinian households living in Gaza Strip in the aftermath of the last war (December 27, 2008 - January 17, 2009).
Sampling frame
The sampling frame was established from the data of the Population, Housing, and Establishment Census, which PCBS conducted late 2007. The frame is a list of enumeration areas. Such areas are used as Primary Sampling Units (PSUs) in the first stage of the sample selection process.
Sample design strata The study group is divided according to the 33 localities of Gaza Strip.
Sample design The sample is organized stratified cluster random sample of two stages: Stage one: An organized stratified random sample of 207 enumeration areas of Gaza Strip localities was selected. Stage two: A random sample of 35, 50, or 100 households is reached, using surveying way; from every enumeration area selected in stage one.
Face-to-face [f2f]
Survey's questionnaire The survey questionnaire on The Impact of the War and Siege on Gaza Strip Survey, 2009 is the main instrument for data collection, and thus its design took into consideration the standard technical specifications to facilitate the collection, processing and analysis of data. Because this type of specialized surveys is new to PCBS, relevant experiences of other countries and international best practices were thoroughly reviewed to ensure the contents and design of the survey's instruments are within international standards. The survey's questionnaire includes the following basic components:
Identification data:
The identification data constitutes the key that uniquely identifies each questionnaire.
The key consists of the questionnaire' serial number, houshold number.
Data quality controls:
A set of quality controls were developed and incorporated into the different phases of the The Impact of the War and Siege on Gaza Strip Survey, 2009 including field operations, office editing, office coding, data processing and survey documentation
Data processing went through a number of stages from start to finish to preparation of files. The stages include: 1. Programming stage: This stage included preparation of the entry programs using ACCESS package, setting up the entry rules to ensure good entry of questionnaires, and setting up cleaning inquiries to examine the data after entry. Such inquiries examine the variables on the questionnaire level. 2. Receiving and controlling questionnaire stage: This stage included receiving questionnaires from the field work coordinator using the format especially prepared for this purpose. The questionnaires were controlled and ensures that they are all received using the format prepared for this purpose. 3. Entry stage: The data entry process started on June 24, 2009 and ended on July 29, 2009. The number of questionnaires entered at PCBS main office was 7543. 4. Data auditing stage: This stage includes recording after entry. This task is conducted by comparing entered data with the original questionnaires to correct entry errors, if any. 5. Data cleansing: Inclusive cleansing rules were set up among the questions of the questionnaire to ensure consistency and to identify out of context or illogical answers. This is done using a special program that was applied on the data. The errors were either corrected or questionnaires were returned to the survey manager to correct the errors.
The sample size of the survey was 7500 houshold. The response rate was 100%
That data of this survey is of a high quality.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Wombat Value of Information (VOI) Analysis
This repository contains the code and data for the manuscript "Natural History Collections at the Crossroads: Shifting Priorities and Data-Driven Opportunities"
Owen Forbes1*, Peter H. Thrall1, Andrew G. Young1, Cheng Soon Ong2
1 CSIRO National Research Collections Australia, Canberra, ACT 2601, Australia
2 CSIRO Data61, Canberra, ACT 2601, Australia
*Corresponding author: owen.forbes@csiro.au
## Repository Overview
This project applies a Value of Information (VOI) analytical framework to wombat occurrence data from GBIF. It demonstrates a novel research prioritisation approach that integrates:
- **Value of Information (VOI)**: Expected information gain from additional observations
- **Need for Information (NFI)**: Habitat condition/loss metrics
- **Cost of Information (COI)**: Remoteness areas affecting sampling feasibility
The analysis uses a binomial model as a simple species distribution model, to evaluate the research value of new wombat observations across different locations in NSW and ACT, Australia.
## Repository Structure
The repository consists of one main Quarto (.qmd) script and associated data files:
- `Wombats_VOI.qmd`: Primary analysis script containing data cleaning, modelling, and visualisation
## Data Files
### GBIF Wombat Occurrence Data
- `0014630-250127130748423.csv` - GBIF wombat occurrence export (download available from: https://www.gbif.org/occurrence/download/0014630-250127130748423 )
### Environmental and Administrative Data
- `HCAS31_AHC_2020_2022_NSW_50km_3577.tif` - Habitat condition raster for NSW (from CSIRO Habitat Condition Assessment System - subset of full dataset available from https://data.csiro.au/collection/csiro%3A63571v7 )
- `RA_2021_AUST_GDA2020` - ABS Remoteness Areas 2021 shapefile (folder containing multiple files - available from https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files
Specficic shapefile URL: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/RA_2021_AUST_GDA2020.zip )
## Requirements
- R (version 4.3.2 or later)
- Required R packages:
- tidyverse (v2.0.0) - for data manipulation and visualization
- sf (v1.0-16) - for spatial data handling
- lubridate (v1.9.3) - for date handling
- raster (v3.6-26) - for raster data handling
- ozmaps (v0.4.5) - for Australian state boundaries
- terra (v1.7-71) - for efficient raster processing
- exactextractr (v0.10.0) - for exact extraction from raster to polygons
- patchwork (v1.2.0) - for combining plots
- corrplot (v0.92) - for correlation matrix visualization
- RColorBrewer (v1.1-3) - for color palettes
- gridExtra (v2.3) - for arranging multiple plots
- grid - for low-level graphics (base R)
Install these packages before running the script.
## Analysis Process
The analysis follows these key steps:
1. **Data Cleaning**
- Filter to NSW and ACT wombat records
- Remove records with high uncertainty (>10km)
- Create a 50km grid system across NSW and ACT
2. **Temporal Analysis**
- Calculate presence/absence of wombats in each grid cell by year
- Determine empirical observation probability for each grid cell
- Analyse temporal stability using 5-year sliding windows
3. **Value of Information Calculation**
- Calculate expected information gain through KL divergence
- Simulate adding new observations using the binomial model
- Map VOI across NSW and ACT
4. **Need for Information Integration**
- Import and process habitat condition data
- Convert to habitat loss metric (NFI)
- Integrate with VOI analysis
5. **Cost of Information**
- Process remoteness area data as a proxy for sampling cost
- Map remoteness areas across the study region
6. **Combined Analysis**
- Create quadrant analysis combining VOI and NFI
- Generate comprehensive visualisations of all three metrics
## How to Use
1. Download this repository to your local machine.
2. Set your working directory to the location of the script.
3. Ensure all required R packages are installed.
4. Run the script in RStudio or your preferred R environment.
## Outputs
The primary output is `combined_quadrant_analysis_100425.png`, which displays four panels:
1. Value of Information (VOI) - Expected information gain across the study area
2. Need for Information (NFI) - Habitat loss percentiles
3. Cost of Information (COI) - Remoteness areas as a proxy for sampling cost
4. VOI vs NFI Quadrant Analysis - Relationship between information value and need
## Citation
If you use this code or methodology, please cite:
Forbes, O., Thrall, P.H., Young, A.G., Ong, C.S. (2025). Natural History Collections at the Crossroads: Shifting Priorities and Data-Driven Opportunities. [Journal information pending]
Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
Continental Europe
Great Britain
Nordic
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].
Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].
Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].
Content of the repository
A) Scripts
In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.
In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).
In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Yearly converted and cleansed data The folders "_converted" contain the output of "convert_data_format.py" and "_cleansed" contain the output of "clean_corrupted_data.py".
File type: The files are zipped csv-files, where each file comprises one year.
Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):
TransnetBW: Continental European Time (CE)
Nationalgrid: Great Britain (GB)
Fingrid: Finland (Europe/Helsinki)
NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.
Use cases We point out that this repository can be used in two different was:
Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.
Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "_converted".
License
This work is licensed under multiple licenses, which are located in the "LICENSES" folder.
We release the code in the folder "Scripts" under the MIT license .
The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.
TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Changelog Version 2:
Add time zone information to description
Include new frequency data
Update references
Change folder structure to yearly folders
Version 3:
Correct TransnetBW files for missing data in May 2016
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
The Wound Cleansing Spray market has gained significant traction in recent years, driven by the growing awareness of wound care and the importance of maintaining hygiene to prevent infection. These sprays are specifically formulated to cleanse wounds, providing a crucial first step in the healing process. By effecti
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
The Department of Statistics (DOS) carried out four rounds of the 2016 Employment and Unemployment Survey (EUS). The survey rounds covered a sample of about fourty nine thousand households Nation-wide. The sampled households were selected using a stratified multi-stage cluster sampling design.
It is worthy to mention that the DOS employed new technology in data collection and data processing. Data was collected using electronic questionnaire instead of a hard copy, namely a hand held device (PDA).
The survey main objectives are: - To identify the demographic, social and economic characteristics of the population and manpower. - To identify the occupational structure and economic activity of the employed persons, as well as their employment status. - To identify the reasons behind the desire of the employed persons to search for a new or additional job. - To measure the economic activity participation rates (the number of economically active population divided by the population of 15+ years old). - To identify the different characteristics of the unemployed persons. - To measure unemployment rates (the number of unemployed persons divided by the number of economically active population of 15+ years old) according to the various characteristics of the unemployed, and the changes that might take place in this regard. - To identify the most important ways and means used by the unemployed persons to get a job, in addition to measuring durations of unemployment for such persons. - To identify the changes overtime that might take place regarding the above-mentioned variables.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a sample representative on the national level (Kingdom), governorates, and the three Regions (Central, North and South).
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
Computer Assisted Personal Interview [capi]
----> Raw Data
A tabulation results plan has been set based on the previous Employment and Unemployment Surveys while the required programs were prepared and tested. When all prior data processing steps were completed, the actual survey results were tabulated using an ORACLE package. The tabulations were then thoroughly checked for consistency of data. The final report was then prepared, containing detailed tabulations as well as the methodology of the survey.
----> Harmonized Data
At Thomson Data, we help businesses clean up and manage messy B2B databases to ensure they are up-to-date, correct, and detailed. We believe your sales development representatives and marketing representatives should focus on building meaningful relationships with prospects, not scrubbing through bad data.
Here are the key steps involved in our B2B data cleansing process:
Data Auditing: We begin with a thorough audit of the database to identify errors, gaps, and inconsistencies, which majorly revolve around identifying outdated, incomplete, and duplicate information.
Data Standardization: Ensuring consistency in the data records is one of our prime services; it includes standardizing job titles, addresses, and company names. It ensures that they can be easily shared and used by different teams.
Data Deduplication: Another way we improve efficiency is by removing all duplicate records. Data deduplication is important in a large B2B dataset as multiple records from the same company may exist in the database.
Data Enrichment: After the first three steps, we enrich your data, fill in the missing details, and then enhance the database with up-to-date records. This is the step that ensures the database is valuable, providing insights that are actionable and complete.
What are the Key Benefits of Keeping the Data Clean with Thomson Data’s B2B Data Cleansing Service? Once you understand the benefits of our data cleansing service, it will entice you to optimize your data management practices, and it will additionally help you stay competitive in today’s data-driven market.
Here are some advantages of maintaining a clean database with Thomson Data:
Better ROI for your Sales and Marketing Campaigns: Our clean data will magnify your precise targeting, enabling you to strategize for effective campaigns, increased conversion rate, and ROI.
Compliant with Data Regulations:
The B2B data cleansing services we provide are compliant to global data norms.
Streamline Operations: Your efforts are directed in the right channel when your data is clean and accurate, as your team doesn’t have to spend their valuable time fixing errors.
To summarize, we would again bring your attention to how accurate data is essential for driving sales and marketing in a B2B environment. It enhances your business prowess in the avenues of decision-making and customer relationships. Therefore, it is better to have a proactive approach toward B2B data cleansing service and outsource our offerings to stay competitive by unlocking the full potential of your data.
Send us a request and we will be happy to assist you.