39 datasets found

d
B2B Data Cleansing Services - Verified Records - Updated Every 30 Days
datarade.ai
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomson Data (2022). B2B Data Cleansing Services - Verified Records - Updated Every 30 Days [Dataset]. https://datarade.ai/data-products/thomson-data-hr-data-reach-hr-professionals-across-the-world-thomson-data
Explore at:
.csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 8, 2022
Dataset authored and provided by
Thomson Data
Area covered
Panama, Finland, Czech Republic, Eritrea, Micronesia (Federated States of), Palau, Bulgaria, Zimbabwe, Denmark, Andorra
Description
At Thomson Data, we help businesses clean up and manage messy B2B databases to ensure they are up-to-date, correct, and detailed. We believe your sales development representatives and marketing representatives should focus on building meaningful relationships with prospects, not scrubbing through bad data.

Here are the key steps involved in our B2B data cleansing process:

Data Auditing: We begin with a thorough audit of the database to identify errors, gaps, and inconsistencies, which majorly revolve around identifying outdated, incomplete, and duplicate information.

Data Standardization: Ensuring consistency in the data records is one of our prime services; it includes standardizing job titles, addresses, and company names. It ensures that they can be easily shared and used by different teams.

Data Deduplication: Another way we improve efficiency is by removing all duplicate records. Data deduplication is important in a large B2B dataset as multiple records from the same company may exist in the database.

Data Enrichment: After the first three steps, we enrich your data, fill in the missing details, and then enhance the database with up-to-date records. This is the step that ensures the database is valuable, providing insights that are actionable and complete.

What are the Key Benefits of Keeping the Data Clean with Thomson Data’s B2B Data Cleansing Service? Once you understand the benefits of our data cleansing service, it will entice you to optimize your data management practices, and it will additionally help you stay competitive in today’s data-driven market.

Here are some advantages of maintaining a clean database with Thomson Data:

Better ROI for your Sales and Marketing Campaigns: Our clean data will magnify your precise targeting, enabling you to strategize for effective campaigns, increased conversion rate, and ROI.

Compliant with Data Regulations:
The B2B data cleansing services we provide are compliant to global data norms.

Streamline Operations: Your efforts are directed in the right channel when your data is clean and accurate, as your team doesn’t have to spend their valuable time fixing errors.

To summarize, we would again bring your attention to how accurate data is essential for driving sales and marketing in a B2B environment. It enhances your business prowess in the avenues of decision-making and customer relationships. Therefore, it is better to have a proactive approach toward B2B data cleansing service and outsource our offerings to stay competitive by unlocking the full potential of your data.

Send us a request and we will be happy to assist you.
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Canada, United States, Global
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
o
Data Cleaning with OpenRefine
explore.openaire.eu
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Ye (2020). Data Cleaning with OpenRefine [Dataset]. http://doi.org/10.5281/zenodo.6863001
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6863001
Dataset updated
Nov 9, 2020
Authors
Hao Ye
Description
OpenRefine (formerly Google Refine) is a powerful free and open source tool for data cleaning, enabling you to correct errors in the data, and make sure that the values and formatting are consistent. In addition, OpenRefine records your processing steps, enabling you to apply the same cleaning procedure to other data, and enhancing the reproducibility of your analysis. This workshop will teach you to use OpenRefine to clean and format data and automatically track any changes that you make.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
Household Survey on Information and Communications Technology– 2019 - West...
pcbs.gov.ps
Updated Mar 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2020). Household Survey on Information and Communications Technology– 2019 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/489
Explore at:
Dataset updated
Mar 16, 2020
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2019
Area covered
Gaza Strip, West Bank, Gaza
Description
Abstract

The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

Geographic coverage

Palestine, West Bank, Gaza strip

Analysis unit

Household, Individual

Universe

All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

Sample size The estimated sample size is 8,040 households.

Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.

Cleaning operations

Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.

Response rate

The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.

Sampling error estimates

Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
i
National Labor Force Survey 1989 - Indonesia
catalog.ihsn.org
datacatalog.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subdirectorate of Manpower Statistics (2019). National Labor Force Survey 1989 - Indonesia [Dataset]. http://catalog.ihsn.org/catalog/4871
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Subdirectorate of Manpower Statistics
Time period covered
1989
Area covered
Indonesia
Description
Abstract

National Labor Force Survey (SAKERNAS) is a survey that is designed to observe the general situation of workforce and also to understand whether there is a change of workforce structure between the enumeration period. Since the survey was initiated in 1976, it has undergone a series of changes affecting its coverage, the frequency of enumeration, the number of households sampled and the type of information collected. It is the largest and most representative source of employment data in Indonesia. For each selected household, the general information about the circumstances of each household member that includes the name, relationship to head of household, sex, and age were collected. Household members aged 10 years and over will be prompted to give the information about their marital status, education and employment.

SAKERNAS is aimed to gather informations that meet three objectives: 1.Employment by education, working hours, industrial classification and employment status, 2.Unemployment and underemployment by different characteristics and efforts on looking for work, 3.Working age population not in the labor force (e.g. attending schools, doing housekeeping and others).

The data for quarterly SAKERNAS was gathered in 1989 covered all provinces in Indonesia, with 65,440 households, scattered both in rural and urban areas and representative until provincial level. The main household data is taken from core questionnaire of SAK89-AK.

Geographic coverage

National coverage* including urban and rural area, representative until provincial level.

*) Although covering all of Indonesia, there are some circumstances when not all provincial were covered. For example, in year 2000, the Province of Maluku excluded in SAKERNAS because horizontal conflicts occurred there. Also, the separation of East Timor from Indonesia in year 1999 also changed the scope of SAKERNAS for the years to come. After that, due to the expansion of regional autonomy as a consequence, the proportion of samples per Province is also changed, as in 2006 when the number of provinces are already 33. However, the difference is only on the number of influential scope/level but not to the pattern. On the other hand, changes in the methodology (including sample size) over time is likely to affect the outcome, for example in years 2000 and 2001, when sample size is only 32.384 and 34.176 the level of data presentation is only representative to island level, (insufficient sample size even to make it representative to provincial level).

Analysis unit

Individual

Universe

The survey covered all de jure household members (usual residents), aged 10 years and over that resident in the household. However, Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample.

Kind of data

Sample survey data

Sampling procedure

Quarterly SAKERNAS 1989 was implemented in the whole territory of the Republic of Indonesia , with a total sample of about 65,440 households, both in rural and urban areas and representative until provincial level. Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample. Data in the dataset indicates the combined sample data consisting results of the 4 rounds quarterly SAKERNAS in 1989, i.e. quarter I, quarter II, quarter III, and quarter IV.

Implementation of SAKERNAS 1989 include samples of the previous enumeration activities (rotation method). Sampling method* to be used is similar for implementation of SAKERNAS years 1986 to 1989, which households selected samples from previous quarter will be partly re-enumerated and then again partly from other household ever elected from another previous quarters, so no need to re-enroll in new household. The procedure for the selection of households in the sample are described in more detail in the enumerators/ supervisors manual document.

*) Sampling method used is varied in different years. For example, in SAKERNAS period of 1986-1989 sampling method used is the method of rotation, where most of the households selected at one period was re-elected in the following period. This often happens on quarterly SAKERNAS on that period. At other periods often use multi-stages sampling method (two or three stages depend on whether sub block census / segment group included or not), or a combination of multi stages sampling also with rotation method (e.g. SAKERNAS 2006-2010).

Mode of data collection

Face-to-face

Research instrument

In SAKERNAS, the questionnaire has been designed in a simple and concise way. It is expected that respondents will understand the aim of question of survey and avoid the memory lapse and uninterested respondents during data collection. Furthermore, the design of SAKERNAS's questionnaire remains stable in order to maintain data comparison.

A household questionnaire was administered in each selected household, which collected general information of household members that includes name, relationship with head of the household, sex and age. Household members aged 10 years and over were then asked about their marital status, education and occupation.

Cleaning operations

Stages of data processing in Sakernas are through process of: - Batching - Editing - Coding - Data Entry - Validation - Tabulation

Sampling error estimates

Sampling error results are presented at the end of the publication of The State of Labor Force in Indonesia and in publication of The State of Workers in Indonesia.
Wine quality dataset with identified duplicates
kaggle.com
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aahz78 (2024). Wine quality dataset with identified duplicates [Dataset]. https://www.kaggle.com/datasets/aahz78/wine-quality-dataset-with-identified-duplicates
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
aahz78
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content

The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.

Description

This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.

The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:

Studying the similarities between different types of wine. Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities. Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.

Key Features

Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy. High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates. Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis. File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.

Usage

This dataset is useful for:

Data cleaning and preprocessing exercises. Duplicate detection and handling techniques. Exploring the impact of duplicates on data analysis and machine learning models. Educational purposes for understanding the importance of data quality. Studying similarities between different types of wine and their characteristics.

File Structure

1dd.json: red wine duplicate records. 1ddw.json wite wine duplicate records. ddrw.json: A file containing information about 100% identical characteristics of red and white wines.

Acknowledgements

This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.
Data and Code for: All Forecasters Are Not the Same: Systematic Patterns in...
openicpsr.org
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert W. Rich; Joseph Tracy (2025). Data and Code for: All Forecasters Are Not the Same: Systematic Patterns in Predictive Performance [Dataset]. http://doi.org/10.3886/E227001V1
Explore at:
Unique identifier
https://doi.org/10.3886/E227001V1
Dataset updated
Apr 17, 2025
Dataset provided by
Federal Reserve Bank of Clevelandhttps://www.clevelandfed.org/
American Enterprise Institute
Authors
Robert W. Rich; Joseph Tracy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview and Contents This replication package was assembled in January of 2025. The code in this repository generates the 13 figures and content of the 3 tables for the paper “All Forecasters Are Not the Same: Systematic Patterns in Predictive Performance”. It also generates the 2 figures and content of the 5 tables in the appendix to this paper. The main contents of the repository are the following: Code/: folder of scripts to prepare and clean data as well as generate tables and figures. Functions/: folder of subroutines for use with MATLAB scripts. Data/: data folder. Raw/: ECB SPF forecast data, realizations of target variables, and start and end bins for density forecasts. Intermediate/: Data used at intermediate steps in the cleaning process. These datasets are generated with x01_Raw_Data_Shell.do, x02a_Individual_Uncertainty_GDP.do, x02b_Individual_Uncertainty_HICP.do, x02c_Individual_Uncertainty_Urate.do, x03_Pull_Data.do, x04_Data_Clean_And_Merge, and x05_Drop_Low_Counts.do in the Code/ folder. Ready/: Data used to conduct regressions, statistical tests, and generate figures. Output/: folder of results. Figures/: .jpg files for each figure used in the paper and its appendix. HL Results/: Results from applying the Hounyo and Lahiri (2023) testing procedure for equal predictive performance to ECB SPF forecast data. This folder contains the material for Tables 1A-4A. Regressions/: Regression results, as well as material for Tables 3 and 5A. Simulations/: Results from simulation exercise as well as the datasets used to create Figures 9-12. Statistical Tests/: Results displayed in Tables 1 and 2. The repository also contains the manuscript, appendix, and this read-me file.DisclaimerThis replication package was produced by the authors and is not an official product of the Federal Reserve Bank of Cleveland. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by the Federal Reserve Bank of Cleveland or the Federal Reserve System.
Labor Force Survey, LFS 2006 - Egypt
erfdataportal.com
Updated Feb 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Agency For Public Mobilization And Statistics (2023). Labor Force Survey, LFS 2006 - Egypt [Dataset]. https://www.erfdataportal.com/index.php/catalog/146
Explore at:
Dataset updated
Feb 5, 2023
Dataset provided by
Central Agency for Public Mobilization and Statisticshttps://www.capmas.gov.eg/
Economic Research Forum
Time period covered
2006
Area covered
Egypt
Description
Abstract

THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE CENTRAL AGENCY FOR PUBLIC MOBILIZATION AND STATISTICS (CAPMAS)

In any society, the human element represents the basis of the work force which exercises all the service and production activities. Therefore, it is a mandate to produce labor force statistics and studies, that is related to the growth and distribution of manpower and labor force distribution by different types and characteristics.

In this context, the Central Agency for Public Mobilization and Statistics conducts "Quarterly Labor Force Survey" which includes data on the size of manpower and labor force (employed and unemployed) and their geographical distribution by their characteristics.

By the end of each year, CAPMAS issues the annual aggregated labor force bulletin publication that includes the results of the quarterly survey rounds that represent the manpower and labor force characteristics during the year.

----> Historical Review of the Labor Force Survey:

1- The First Labor Force survey was undertaken in 1957. The first round was conducted in November of that year, the survey continued to be conducted in successive rounds (quarterly, bi-annually, or annually) till now.

2- Starting the October 2006 round, the fieldwork of the labor force survey was developed to focus on the following two points: a. The importance of using the panel sample that is part of the survey sample, to monitor the dynamic changes of the labor market. b. Improving the used questionnaire to include more questions, that help in better defining of relationship to labor force of each household member (employed, unemployed, out of labor force ...etc.). In addition to re-order of some of the already existing questions in much logical way.

3- Starting the January 2008 round, the used methodology was developed to collect more representative sample during the survey year. this is done through distributing the sample of each governorate into five groups, the questionnaires are collected from each of them separately every 15 days for 3 months (in the middle and the end of the month)

----> The survey aims at covering the following topics:

1- Measuring the size of the Egyptian labor force among civilians (for all governorates of the republic) by their different characteristics. 2- Measuring the employment rate at national level and different geographical areas. 3- Measuring the distribution of employed people by the following characteristics: gender, age, educational status, occupation, economic activity, and sector. 4- Measuring unemployment rate at different geographic areas. 5- Measuring the distribution of unemployed people by the following characteristics: gender, age, educational status, unemployment type "ever employed/never employed", occupation, economic activity, and sector for people who have ever worked.

The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.

Geographic coverage

Covering a sample of urban and rural areas in all the governorates.

Analysis unit

1- Household/family. 2- Individual/person.

Universe

The survey covered a national sample of households and all individuals permanently residing in surveyed households.

Kind of data

Sample survey data [ssd]

Sampling procedure

THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE CENTRAL AGENCY FOR PUBLIC MOBILIZATION AND STATISTICS (CAPMAS)

----> Sample Design and Selection

The sample of the LFS 2006 survey is a simple systematic random sample.

----> Sample Size

The sample size varied in each quarter (it is Q1=19429, Q2=19419, Q3=19119 and Q4=18835) households with a total number of 76802 households annually. These households are distributed on the governorate level (urban/rural).

A more detailed description of the different sampling stages and allocation of sample across governorates is provided in the Methodology document available among external resources in Arabic.

Mode of data collection

Face-to-face [f2f]

Research instrument

The questionnaire design follows the latest International Labor Organization (ILO) concepts and definitions of labor force, employment, and unemployment.

The questionnaire comprises 3 tables in addition to the identification and geographic data of household on the cover page.

----> Table 1- Demographic and employment characteristics and basic data for all household individuals

Including: gender, age, educational status, marital status, residence mobility and current work status

----> Table 2- Employment characteristics table

This table is filled by employed individuals at the time of the survey or those who were engaged to work during the reference week, and provided information on: - Relationship to employer: employer, self-employed, waged worker, and unpaid family worker - Economic activity - Sector - Occupation - Effective working hours - Work place - Average monthly wage

----> Table 3- Unemployment characteristics table

This table is filled by all unemployed individuals who satisfied the unemployment criteria, and provided information on: - Type of unemployment (unemployed, unemployed ever worked) - Economic activity and occupation in the last held job before being unemployed - Last unemployment duration in months - Main reason for unemployment

Cleaning operations

----> Raw Data

Office editing is one of the main stages of the survey. It started once the questionnaires were received from the field and accomplished by the selected work groups. It includes: a-Editing of coverage and completeness b-Editing of consistency

----> Harmonized Data

The STATA is used to clean and SPSS is used harmonize the datasets.

The harmonization process starts with a cleaning process for all raw data files received from the Statistical Agency.

All cleaned data files are then merged to produce one data file on the individual level containing all variables subject to harmonization.

A country-specific program is generated for each dataset to generate/ compute/ recode/ rename/ format/ label harmonized variables.

A post-harmonization cleaning process is then conducted on the data.

Harmonized data is saved on the household as well as the individual level, in SPSS and then converted to STATA, to be disseminated.
q
An Introduction to Data Science Using Citizen Science Data
qubeshub.org
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Kirschtel (2025). An Introduction to Data Science Using Citizen Science Data [Dataset]. http://doi.org/10.25334/0S9K-B268
Explore at:
Unique identifier
https://doi.org/10.25334/0S9K-B268
Dataset updated
Jan 17, 2025
Dataset provided by
QUBES
Authors
David Kirschtel
Description
This activity provides the students with a basic introduction data science. Students will work their way through the process of downloading online data. data cleaning, analysis and presentation. Estimated total time is four hours, but can be easily broken into several blocks. Opportunities for formative and summative assessment at the end of each major step. This activity would fit well in courses for: ecology, environmental science/policy, water science. This activity uses citizen science data from the Chesapeake Water Watch program.
case study 1 bike share
kaggle.com
Updated Oct 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mohamed osama (2022). case study 1 bike share [Dataset]. https://www.kaggle.com/ososmm/case-study-1-bike-share/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mohamed osama
Description
Cyclistic: Google Data Analytics Capstone Project

Cyclistic - Google Data Analytics Certification Capstone Project Moirangthem Arup Singh How Does a Bike-Share Navigate Speedy Success? Background: This project is for the Google Data Analytics Certification capstone project. I am wearing the hat of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. This project will be completed by using the 6 Data Analytics stages: Ask: Identify the business task and determine the key stakeholders. Prepare: Collect the data, identify how it’s organized, determine the credibility of the data. Process: Select the tool for data cleaning, check for errors and document the cleaning process. Analyze: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships. Share: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task. Act: Share the final conclusion and the recommendations. Ask: Business Task: Recommend marketing strategies aimed at converting casual riders into annual members by better understanding how annual members and casual riders use Cyclistic bikes differently. Stakeholders: Lily Moreno: The director of marketing and my manager. Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program. Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy. Prepare: For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under the license. I downloaded the ZIP files containing the csv files from the above link but while uploading the files in kaggle (as I am using kaggle notebook), it gave me a warning that the dataset is already available in kaggle. So I will be using the dataset cyclictic-bike-share dataset from kaggle. The dataset has 13 csv files from April 2020 to April 2021. For the purpose of my analysis I will use the csv files from April 2020 to March 2021. The source csv files are in Kaggle so I can rely on it's integrity. I am using Microsoft Excel to get a glimpse of the data. There is one csv file for each month and has information about the bike ride which contain details of the ride id, rideable type, start and end time, start and end station, latitude and longitude of the start and end stations. Process: I will use R as language in kaggle to import the dataset to check how it’s organized, whether all the columns have appropriate data type, find outliers and if any of these data have sampling bias. I will be using below R libraries

Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

library(tidyverse) library(lubridate) library(ggplot2) library(plotrix) ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

date, intersect, setdiff, union

Set the working directory

setwd("/kaggle/input/cyclistic-bike-share")

Import the csv files

r_202004 <- read.csv("202004-divvy-tripdata.csv") r_202005 <- read.csv("20...
i
Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102...
datacatalog.ihsn.org
catalog.ihsn.org
+1more
Updated Jul 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Statistical Office (NSO) (2023). Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs) - Malawi [Dataset]. https://datacatalog.ihsn.org/catalog/8702
Explore at:
Dataset updated
Jul 19, 2023
Dataset authored and provided by
National Statistical Office (NSO)
Time period covered
2010 - 2019
Area covered
Malawi
Description
Abstract

The 2016 Integrated Household Panel Survey (IHPS) was launched in April 2016 as part of the Malawi Fourth Integrated Household Survey fieldwork operation. The IHPS 2016 targeted 1,989 households that were interviewed in the IHPS 2013 and that could be traced back to half of the 204 enumeration areas that were originally sampled as part of the Third Integrated Household Survey (IHS3) 2010/11. The 2019 IHPS was launched in April 2019 as part of the Malawi Fifth Integrated Household Survey fieldwork operations targeting the 2,508 households that were interviewed in 2016. The panel sample expanded each wave through the tracking of split-off individuals and the new households that they formed. Available as part of this project is the IHPS 2019 data, the IHPS 2016 data as well as the rereleased IHPS 2010 & 2013 data including only the subsample of 102 EAs with updated panel weights. Additionally, the IHPS 2016 was the first survey that received complementary financial and technical support from the Living Standards Measurement Study – Plus (LSMS+) initiative, which has been established with grants from the Umbrella Facility for Gender Equality Trust Fund, the World Bank Trust Fund for Statistical Capacity Building, and the International Fund for Agricultural Development, and is implemented by the World Bank Living Standards Measurement Study (LSMS) team, in collaboration with the World Bank Gender Group and partner national statistical offices. The LSMS+ aims to improve the availability and quality of individual-disaggregated household survey data, and is, at start, a direct response to the World Bank IDA18 commitment to support 6 IDA countries in collecting intra-household, sex-disaggregated household survey data on 1) ownership of and rights to selected physical and financial assets, 2) work and employment, and 3) entrepreneurship – following international best practices in questionnaire design and minimizing the use of proxy respondents while collecting personal information. This dataset is included here.

Geographic coverage

National coverage

Analysis unit

Households

Individuals

Children under 5 years

Consumption expenditure commodities/items

Communities

Agricultural household/ Holder/ Crop

Universe

The IHPS 2016 and 2019 attempted to track all IHPS 2013 households stemming from 102 of the original 204 baseline panel enumeration areas as well as individuals that moved away from the 2013 dwellings between 2013 and 2016 as long as they were neither servants nor guests at the time of the IHPS 2013; were projected to be at least 12 years of age and were known to be residing in mainland Malawi but excluding those in Likoma Island and in institutions, including prisons, police compounds, and army barracks.

Kind of data

Sample survey data [ssd]

Sampling procedure

A sub-sample of IHS3 2010 sample enumeration areas (EAs) (i.e. 204 EAs out of 768 EAs) was selected prior to the start of the IHS3 field work with the intention to (i) to track and resurvey these households in 2013 in accordance with the IHS3 fieldwork timeline and as part of the Integrated Household Panel Survey (IHPS 2013) and (ii) visit a total of 3,246 households in these EAs twice to reduce recall associated with different aspects of agricultural data collection. At baseline, the IHPS sample was selected to be representative at the national, regional, urban/rural levels and for each of the following 6 strata: (i) Northern Region - Rural, (ii) Northern Region - Urban, (iii) Central Region - Rural, (iv) Central Region - Urban, (v) Southern Region - Rural, and (vi) Southern Region - Urban. The IHPS 2013 main fieldwork took place during the period of April-October 2013, with residual tracking operations in November-December 2013.

Given budget and resource constraints, for the IHPS 2016 the number of sample EAs in the panel was reduced to 102 out of the 204 EAs. As a result, the domains of analysis are limited to the national, urban and rural areas. Although the results of the IHPS 2016 cannot be tabulated by region, the stratification of the IHPS by region, urban and rural strata was maintained. The IHPS 2019 tracked all individuals 12 years or older from the 2016 households.

Mode of data collection

Computer Assisted Personal Interview [capi]

Cleaning operations

Data Entry Platform To ensure data quality and timely availability of data, the IHPS 2019 was implemented using the World Bank’s Survey Solutions CAPI software. To carry out IHPS 2019, 1 laptop computer and a wireless internet router were assigned to each team supervisor, and each enumerator had an 8–inch GPS-enabled Lenovo tablet computer that the NSO provided. The use of Survey Solutions allowed for the real-time availability of data as the completed data was completed, approved by the Supervisor and synced to the Headquarters server as frequently as possible. While administering the first module of the questionnaire the enumerator(s) also used their tablets to record the GPS coordinates of the dwelling units. Geo-referenced household locations from that tablet complemented the GPS measurements taken by the Garmin eTrex 30 handheld devices and these were linked with publically available geospatial databases to enable the inclusion of a number of geospatial variables - extensive measures of distance (i.e. distance to the nearest market), climatology, soil and terrain, and other environmental factors - in the analysis.

Data Management The IHPS 2019 Survey Solutions CAPI based data entry application was designed to stream-line the data collection process from the field. IHPS 2019 Interviews were mainly collected in “sample” mode (assignments generated from headquarters) and a few in “census” mode (new interviews created by interviewers from a template) for the NSO to have more control over the sample. This hybrid approach was necessary to aid the tracking operations whereby an enumerator could quickly create a tracking assignment considering that they were mostly working in areas with poor network connection and hence could not quickly receive tracking cases from Headquarters.

The range and consistency checks built into the application was informed by the LSMS-ISA experience with the IHS3 2010/11, IHPS 2013 and IHPS 2016. Prior programming of the data entry application allowed for a wide variety of range and consistency checks to be conducted and reported and potential issues investigated and corrected before closing the assigned enumeration area. Headquarters (the NSO management) assigned work to the supervisors based on their regions of coverage. The supervisors then made assignments to the enumerators linked to their supervisor account. The work assignments and syncing of completed interviews took place through a Wi-Fi connection to the IHPS 2019 server. Because the data was available in real time it was monitored closely throughout the entire data collection period and upon receipt of the data at headquarters, data was exported to Stata for other consistency checks, data cleaning, and analysis.

Data Cleaning The data cleaning process was done in several stages over the course of fieldwork and through preliminary analysis. The first stage of data cleaning was conducted in the field by the field-based field teams utilizing error messages generated by the Survey Solutions application when a response did not fit the rules for a particular question. For questions that flagged an error, the enumerators were expected to record a comment within the questionnaire to explain to their supervisor the reason for the error and confirming that they double checked the response with the respondent. The supervisors were expected to sync the enumerator tablets as frequently as possible to avoid having many questionnaires on the tablet, and to enable daily checks of questionnaires. Some supervisors preferred to review completed interviews on the tablets so they would review prior to syncing but still record the notes in the supervisor account and reject questionnaires accordingly. The second stage of data cleaning was also done in the field, and this resulted from the additional error reports generated in Stata, which were in turn sent to the field teams via email or DropBox. The field supervisors collected reports for their assignments and in coordination with the enumerators reviewed, investigated, and collected errors. Due to the quick turn-around in error reporting, it was possible to conduct call-backs while the team was still operating in the EA when required. Corrections to the data were entered in the rejected questionnaires and sent back to headquarters.

The data cleaning process was done in several stages over the course of the fieldwork and through preliminary analyses. The first stage was during the interview itself. Because CAPI software was used, as enumerators asked the questions and recorded information, error messages were provided immediately when the information recorded did not match previously defined rules for that variable. For example, if the education level for a 12 year old respondent was given as post graduate. The second stage occurred during the review of the questionnaire by the Field Supervisor. The Survey Solutions software allows errors to remain in the data if the enumerator does not make a correction. The enumerator can write a comment to explain why the data appears to be incorrect. For example, if the previously mentioned 12 year old was, in fact, a genius who had completed graduate studies. The next stage occurred when the data were transferred to headquarters where the NSO staff would again review the data for errors and verify the comments from the
The Impact of the War and Siege on Gaza Strip Survey, 2009 - West Bank and...
pcbs.gov.ps
Updated Dec 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2019). The Impact of the War and Siege on Gaza Strip Survey, 2009 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/400
Explore at:
Dataset updated
Dec 24, 2019
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2009
Area covered
Gaza Strip, Palestine, West Bank, Gaza
Description
Abstract

The war Israel launched on Gaza Strip led to the total destruction of the infrastructure of the Gaza Strip and turned that whole area into a disaster zone. More than 1,300 people were killed from the start of the war on December 27, 2008 until its end. Additionally, 5,400 people or more were wounded and more than 3,000 establishments and thousands of homes were destroyed. And tens of thousands of people were displaced.

The reconstruction of what the war had destroyed requires accurate and reliable data in all fields including socioeconomic, environmental, health, construction, and other fields. The purpose of conducting the Impact of the War and Siege on Gaza Strip Survey is to produce findings that would become a tool for the best planning for reconstruction and review of the situation of the Palestinian households in Gaza Strip from the economic, health, education, and environmental perspectives.

The survey is conducted in cooperation with a number of international organizations including WFP, FAO, in order to assess the impact of the war on the socioeconomic situations of the households in Gaza Strip and establish the necessary plans and policies to improve the socioeconomic, environmental, health, and educational aspects of the Gaza Strip.

Geographic coverage

GAZA STRIP

Analysis unit

Household

Universe

The study group of the survey consists of the entire Palestinian households living in Gaza Strip in the aftermath of the last war (December 27, 2008 - January 17, 2009)

Kind of data

Sample survey data [ssd]

Sampling procedure

The study group of the survey consists of the entire Palestinian households living in Gaza Strip in the aftermath of the last war (December 27, 2008 - January 17, 2009).

Sampling frame

The sampling frame was established from the data of the Population, Housing, and Establishment Census, which PCBS conducted late 2007. The frame is a list of enumeration areas. Such areas are used as Primary Sampling Units (PSUs) in the first stage of the sample selection process.

Sample design strata The study group is divided according to the 33 localities of Gaza Strip.

Sample design The sample is organized stratified cluster random sample of two stages: Stage one: An organized stratified random sample of 207 enumeration areas of Gaza Strip localities was selected. Stage two: A random sample of 35, 50, or 100 households is reached, using surveying way; from every enumeration area selected in stage one.

Mode of data collection

Face-to-face [f2f]

Research instrument

Survey's questionnaire The survey questionnaire on The Impact of the War and Siege on Gaza Strip Survey, 2009 is the main instrument for data collection, and thus its design took into consideration the standard technical specifications to facilitate the collection, processing and analysis of data. Because this type of specialized surveys is new to PCBS, relevant experiences of other countries and international best practices were thoroughly reviewed to ensure the contents and design of the survey's instruments are within international standards. The survey's questionnaire includes the following basic components:

Identification data: The identification data constitutes the key that uniquely identifies each questionnaire. The key consists of the questionnaire' serial number, houshold number.
Data quality controls: A set of quality controls were developed and incorporated into the different phases of the The Impact of the War and Siege on Gaza Strip Survey, 2009 including field operations, office editing, office coding, data processing and survey documentation

Cleaning operations

Data processing went through a number of stages from start to finish to preparation of files. The stages include: 1. Programming stage: This stage included preparation of the entry programs using ACCESS package, setting up the entry rules to ensure good entry of questionnaires, and setting up cleaning inquiries to examine the data after entry. Such inquiries examine the variables on the questionnaire level. 2. Receiving and controlling questionnaire stage: This stage included receiving questionnaires from the field work coordinator using the format especially prepared for this purpose. The questionnaires were controlled and ensures that they are all received using the format prepared for this purpose. 3. Entry stage: The data entry process started on June 24, 2009 and ended on July 29, 2009. The number of questionnaires entered at PCBS main office was 7543. 4. Data auditing stage: This stage includes recording after entry. This task is conducted by comparing entered data with the original questionnaires to correct entry errors, if any. 5. Data cleansing: Inclusive cleansing rules were set up among the questions of the questionnaire to ensure consistency and to identify out of context or illogical answers. This is done using a special program that was applied on the data. The errors were either corrected or questionnaires were returned to the survey manager to correct the errors.

Response rate

The sample size of the survey was 7500 houshold. The response rate was 100%

Sampling error estimates

That data of this survey is of a high quality.
Data and code for: "Natural History Collections at the Crossroads: Shifting...
zenodo.org
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Forbes; Owen Forbes (2025). Data and code for: "Natural History Collections at the Crossroads: Shifting Priorities and Data-Driven Opportunities" [Dataset]. http://doi.org/10.5281/zenodo.15221278
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15221278
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Owen Forbes; Owen Forbes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Wombat Value of Information (VOI) Analysis

This repository contains the code and data for the manuscript "Natural History Collections at the Crossroads: Shifting Priorities and Data-Driven Opportunities"

Owen Forbes1*, Peter H. Thrall1, Andrew G. Young1, Cheng Soon Ong2

1 CSIRO National Research Collections Australia, Canberra, ACT 2601, Australia
2 CSIRO Data61, Canberra, ACT 2601, Australia
*Corresponding author: owen.forbes@csiro.au

## Repository Overview

This project applies a Value of Information (VOI) analytical framework to wombat occurrence data from GBIF. It demonstrates a novel research prioritisation approach that integrates:

- **Value of Information (VOI)**: Expected information gain from additional observations
- **Need for Information (NFI)**: Habitat condition/loss metrics
- **Cost of Information (COI)**: Remoteness areas affecting sampling feasibility

The analysis uses a binomial model as a simple species distribution model, to evaluate the research value of new wombat observations across different locations in NSW and ACT, Australia.

## Repository Structure

The repository consists of one main Quarto (.qmd) script and associated data files:

- `Wombats_VOI.qmd`: Primary analysis script containing data cleaning, modelling, and visualisation

## Data Files

### GBIF Wombat Occurrence Data
- `0014630-250127130748423.csv` - GBIF wombat occurrence export (download available from: https://www.gbif.org/occurrence/download/0014630-250127130748423 )

### Environmental and Administrative Data
- `HCAS31_AHC_2020_2022_NSW_50km_3577.tif` - Habitat condition raster for NSW (from CSIRO Habitat Condition Assessment System - subset of full dataset available from https://data.csiro.au/collection/csiro%3A63571v7 )
- `RA_2021_AUST_GDA2020` - ABS Remoteness Areas 2021 shapefile (folder containing multiple files - available from https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files
Specficic shapefile URL: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/RA_2021_AUST_GDA2020.zip )

## Requirements

- R (version 4.3.2 or later)
- Required R packages:
- tidyverse (v2.0.0) - for data manipulation and visualization
- sf (v1.0-16) - for spatial data handling
- lubridate (v1.9.3) - for date handling
- raster (v3.6-26) - for raster data handling
- ozmaps (v0.4.5) - for Australian state boundaries
- terra (v1.7-71) - for efficient raster processing
- exactextractr (v0.10.0) - for exact extraction from raster to polygons
- patchwork (v1.2.0) - for combining plots
- corrplot (v0.92) - for correlation matrix visualization
- RColorBrewer (v1.1-3) - for color palettes
- gridExtra (v2.3) - for arranging multiple plots
- grid - for low-level graphics (base R)

Install these packages before running the script.

## Analysis Process

The analysis follows these key steps:

1. **Data Cleaning**
- Filter to NSW and ACT wombat records
- Remove records with high uncertainty (>10km)
- Create a 50km grid system across NSW and ACT

2. **Temporal Analysis**
- Calculate presence/absence of wombats in each grid cell by year
- Determine empirical observation probability for each grid cell
- Analyse temporal stability using 5-year sliding windows

3. **Value of Information Calculation**
- Calculate expected information gain through KL divergence
- Simulate adding new observations using the binomial model
- Map VOI across NSW and ACT

4. **Need for Information Integration**
- Import and process habitat condition data
- Convert to habitat loss metric (NFI)
- Integrate with VOI analysis

5. **Cost of Information**
- Process remoteness area data as a proxy for sampling cost
- Map remoteness areas across the study region

6. **Combined Analysis**
- Create quadrant analysis combining VOI and NFI
- Generate comprehensive visualisations of all three metrics

## How to Use

1. Download this repository to your local machine.
2. Set your working directory to the location of the script.
3. Ensure all required R packages are installed.
4. Run the script in RStudio or your preferred R environment.

## Outputs

The primary output is `combined_quadrant_analysis_100425.png`, which displays four panels:
1. Value of Information (VOI) - Expected information gain across the study area
2. Need for Information (NFI) - Habitat loss percentiles
3. Cost of Information (COI) - Remoteness areas as a proxy for sampling cost
4. VOI vs NFI Quadrant Analysis - Relationship between information value and need

## Citation

If you use this code or methodology, please cite:

Forbes, O., Thrall, P.H., Young, A.G., Ong, C.S. (2025). Natural History Collections at the Crossroads: Shifting Priorities and Data-Driven Opportunities. [Journal information pending]
Z
Pre-Processed Power Grid Frequency Time Series
data.niaid.nih.gov
Updated Jul 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruse, Johannes (2021). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3744120
Explore at:
Dataset updated
Jul 15, 2021
Dataset provided by
Witthaut, Dirk
Kruse, Johannes
Schäfer, Benjamin
Description
Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

Continental Europe

Great Britain

Nordic

This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

B) Yearly converted and cleansed data The folders "_converted" contain the output of "convert_data_format.py" and "_cleansed" contain the output of "clean_corrupted_data.py".

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

TransnetBW: Continental European Time (CE)

Nationalgrid: Great Britain (GB)

Fingrid: Finland (Europe/Helsinki)

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

Use cases We point out that this repository can be used in two different was:

Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "_converted".

License

This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

We release the code in the folder "Scripts" under the MIT license .

The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.

TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

Changelog Version 2:

Add time zone information to description

Include new frequency data

Update references

Change folder structure to yearly folders

Version 3:

Correct TransnetBW files for missing data in May 2016
Global Wound Cleansing Spray Market Strategic Planning Insights 2025-2032
statsndata.org
excel, pdf
Updated Jun 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Wound Cleansing Spray Market Strategic Planning Insights 2025-2032 [Dataset]. https://www.statsndata.org/report/wound-cleansing-spray-market-375169
Explore at:
excel, pdfAvailable download formats
Dataset updated
Jun 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Wound Cleansing Spray market has gained significant traction in recent years, driven by the growing awareness of wound care and the importance of maintaining hygiene to prevent infection. These sprays are specifically formulated to cleanse wounds, providing a crucial first step in the healing process. By effecti
e
Employment and Unemployment Survey, EUS 2016 - Jordan
erfdataportal.com
Updated Oct 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Statistics (2017). Employment and Unemployment Survey, EUS 2016 - Jordan [Dataset]. http://www.erfdataportal.com/index.php/catalog/133
Explore at:
Dataset updated
Oct 22, 2017
Dataset provided by
Department of Statistics
Economic Research Forum
Time period covered
2016
Area covered
Jordan
Description
Abstract

THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN

The Department of Statistics (DOS) carried out four rounds of the 2016 Employment and Unemployment Survey (EUS). The survey rounds covered a sample of about fourty nine thousand households Nation-wide. The sampled households were selected using a stratified multi-stage cluster sampling design.

It is worthy to mention that the DOS employed new technology in data collection and data processing. Data was collected using electronic questionnaire instead of a hard copy, namely a hand held device (PDA).

The survey main objectives are: - To identify the demographic, social and economic characteristics of the population and manpower. - To identify the occupational structure and economic activity of the employed persons, as well as their employment status. - To identify the reasons behind the desire of the employed persons to search for a new or additional job. - To measure the economic activity participation rates (the number of economically active population divided by the population of 15+ years old). - To identify the different characteristics of the unemployed persons. - To measure unemployment rates (the number of unemployed persons divided by the number of economically active population of 15+ years old) according to the various characteristics of the unemployed, and the changes that might take place in this regard. - To identify the most important ways and means used by the unemployed persons to get a job, in addition to measuring durations of unemployment for such persons. - To identify the changes overtime that might take place regarding the above-mentioned variables.

The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.

Geographic coverage

Covering a sample representative on the national level (Kingdom), governorates, and the three Regions (Central, North and South).

Analysis unit

1- Household/family. 2- Individual/person.

Universe

The survey covered a national sample of households and all individuals permanently residing in surveyed households.

Kind of data

Sample survey data [ssd]

Sampling procedure

THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN

Mode of data collection

Computer Assisted Personal Interview [capi]

Cleaning operations

----> Raw Data

A tabulation results plan has been set based on the previous Employment and Unemployment Surveys while the required programs were prepared and tested. When all prior data processing steps were completed, the actual survey results were tabulated using an ORACLE package. The tabulations were then thoroughly checked for consistency of data. The final report was then prepared, containing detailed tabulations as well as the methodology of the survey.

----> Harmonized Data

The SPSS package is used to clean and harmonize the datasets.

The harmonization process starts with a cleaning process for all raw data files received from the Statistical Agency.

All cleaned data files are then merged to produce one data file on the individual level containing all variables subject to harmonization.

A country-specific program is generated for each dataset to generate/ compute/ recode/ rename/ format/ label harmonized variables.

A post-harmonization cleaning process is then conducted on the data.

Harmonized data is saved on the household as well as the individual level, in SPSS and then converted to STATA, to be disseminated.

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomson Data (2022). B2B Data Cleansing Services - Verified Records - Updated Every 30 Days [Dataset]. https://datarade.ai/data-products/thomson-data-hr-data-reach-hr-professionals-across-the-world-thomson-data

B2B Data Cleansing Services - Verified Records - Updated Every 30 Days

Explore at:

.csv, .xls, .sql, .txtAvailable download formats

Dataset updated

Jan 8, 2022

Dataset authored and provided by

Thomson Data

Area covered

Panama, Finland, Czech Republic, Eritrea, Micronesia (Federated States of), Palau, Bulgaria, Zimbabwe, Denmark, Andorra

Description

At Thomson Data, we help businesses clean up and manage messy B2B databases to ensure they are up-to-date, correct, and detailed. We believe your sales development representatives and marketing representatives should focus on building meaningful relationships with prospects, not scrubbing through bad data.

Here are the key steps involved in our B2B data cleansing process:

Data Auditing: We begin with a thorough audit of the database to identify errors, gaps, and inconsistencies, which majorly revolve around identifying outdated, incomplete, and duplicate information.
Data Standardization: Ensuring consistency in the data records is one of our prime services; it includes standardizing job titles, addresses, and company names. It ensures that they can be easily shared and used by different teams.
Data Deduplication: Another way we improve efficiency is by removing all duplicate records. Data deduplication is important in a large B2B dataset as multiple records from the same company may exist in the database.
Data Enrichment: After the first three steps, we enrich your data, fill in the missing details, and then enhance the database with up-to-date records. This is the step that ensures the database is valuable, providing insights that are actionable and complete.

What are the Key Benefits of Keeping the Data Clean with Thomson Data’s B2B Data Cleansing Service? Once you understand the benefits of our data cleansing service, it will entice you to optimize your data management practices, and it will additionally help you stay competitive in today’s data-driven market.

Here are some advantages of maintaining a clean database with Thomson Data:

Better ROI for your Sales and Marketing Campaigns: Our clean data will magnify your precise targeting, enabling you to strategize for effective campaigns, increased conversion rate, and ROI.
Compliant with Data Regulations:
The B2B data cleansing services we provide are compliant to global data norms.
Streamline Operations: Your efforts are directed in the right channel when your data is clean and accurate, as your team doesn’t have to spend their valuable time fixing errors.

To summarize, we would again bring your attention to how accurate data is essential for driving sales and marketing in a B2B environment. It enhances your business prowess in the avenues of decision-making and customer relationships. Therefore, it is better to have a proactive approach toward B2B data cleansing service and outsource our offerings to stay competitive by unlocking the full potential of your data.

Send us a request and we will be happy to assist you.

Clear search

Close search

Google apps

Main menu

B2B Data Cleansing Services - Verified Records - Updated Every 30 Days

Data Cleaning Sample

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Data Cleaning with OpenRefine

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

LSC (Leicester Scientific Corpus)

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Household Survey on Information and Communications Technology– 2019 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

National Labor Force Survey 1989 - Indonesia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Sampling error estimates

Wine quality dataset with identified duplicates

Data and Code for: All Forecasters Are Not the Same: Systematic Patterns in...

Labor Force Survey, LFS 2006 - Egypt

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

An Introduction to Data Science Using Citizen Science Data

case study 1 bike share

Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

Set the working directory

Import the csv files

Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure