97 datasets found

B
Data Cleaning Sample
borealisdata.ca
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
I
Data for A Conceptual Model for Transparent, Reusable, and Collaborative...
databank.illinois.edu
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaus Parulian (2023). Data for A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning [Dataset]. http://doi.org/10.13012/B2IDB-6827044_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-6827044_V1
Dataset updated
Jul 12, 2023
Authors
Nikolaus Parulian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning. Each chapter has a demo folder for demonstrating provenance queries or tools. The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website. Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

technavio.com

Updated Feb 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis

Explore at:

Dataset updated

Feb 15, 2025

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

Global, Canada, United States

Description

Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to gain valuable insights from their data more efficiently and effectively, leading to improved decision-making and operational efficiency. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. These technologies offer increased flexibility, scalability, and ease of deployment, making it simpler for businesses to implement and manage their data science initiatives. However, the market is not without challenges. Data privacy and security remain critical concerns, as the use of data science platforms involves handling large volumes of sensitive data.
Ensuring security measures and adhering to data protection regulations are essential for companies seeking to capitalize on the opportunities presented by this dynamic market. Companies must navigate these challenges while staying abreast of emerging trends and technologies to remain competitive and deliver value to their customers.

What will be the Size of the Data Science Platform Market during the forecast period?

Request Free Sample

The market encompasses a range of software applications that facilitate various stages of the data science workflow, from data acquisition and preprocessing to machine learning model development, training, and distribution. This market is driven by the increasing demand for data exploration and analysis across industries, fueled by the proliferation of machine data from IoT devices and the availability of big data from various sources, including multimedia, business, and consumer data. Data scientists require comprehensive tools to manage the complete life cycle of their projects, from data preparation and cleaning to visualization and modeling. Cloud-based solutions have gained significant traction due to their flexibility and scalability, enabling users to process and analyze large volumes of unstructured and structured data using relational databases and artificial intelligence (AI) and machine learning (ML) techniques.
The market is expected to grow substantially due to the rising adoption of ML models and the need for efficient model development, training, and deployment. Preprocessing, data cleaning, and model distribution are critical components of this market, ensuring the accuracy and reliability of ML models and their seamless integration into various applications. Overall, the market is a dynamic and evolving landscape, offering numerous opportunities for businesses to leverage AI and ML technologies for data-driven insights and decision-making.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  On-premises
  Cloud


Component

  Platform
  Services


End-user

  BFSI
  Retail and e-commerce
  Manufacturing
  Media and entertainment
  Others


Sector

  Large enterprises
  SMEs


Application

  Data Preparation
  Data Visualization
  Machine Learning
  Predictive Analytics
  Data Governance
  Others


Geography

  North America

    US
    Canada


  Europe

    France
    Germany
    UK


  APAC

    China
    India
    Japan


  South America

    Brazil


  Middle East and Africa

    UAE


  Rest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period. In today's data-driven business landscape, organizations are continually seeking innovative solutions to manage and leverage their structured and unstructured data. While cloud-based solutions have gained popularity for their scalability and cost-effectiveness, on-premises deployment remains a preferred choice for enterprise types with stringent data security requirements. On-premises deployment offers several advantages, including quick adaptation to corporate needs, data security, and the elimination of third-party data maintenance and security concerns. With on-premises software, businesses can avoid data transfer over the internet, ensuring data privacy and confidentiality. Moreover, on-premises solutions enable easy and rapid data access, allowing employees to make data-driven decisions in real-time.

However, on-premises deployment comes with its challenges, such as a lack of workforce with the necessary data skills and technical expertise for model development, deployment, and integration. To address thes

d
B2B Intent Data - ABM Data - 152M+ Profiles - 13M+ Companies - 150+ Data...
datarade.ai
.csv, .xls
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomson Data (2024). B2B Intent Data - ABM Data - 152M+ Profiles - 13M+ Companies - 150+ Data points - Updated monthly [Dataset]. https://datarade.ai/data-products/b2b-data-cleansing-services-thomson-data
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Nov 16, 2024
Dataset authored and provided by
Thomson Data
Area covered
Peru, Malawi, Kenya, Guadeloupe, Brazil, Saudi Arabia, Vietnam, Western Sahara, Panama, Virgin Islands (U.S.)
Description
What is Account-Based-Marketing? Account-based marketing, or ABM, is a business strategy that focuses your resources on a specific segment of customer accounts. It's all about understanding your customers on a personal level and delivering personalized campaigns that resonate with their needs and preferences.

Why should you use Thomson Data’s Data solution for Account Based Marketing (ABM)? Utilizing Account-based marketing data for your marketing campaign might seem like a long-draw-out approach, but it is absolutely worth the hassle.

Here are some of the benefits you will definitely be interested in.

Boost Lead Generation: Our database is designed for effective account-based marketing that will boost lead generation. We enable you to target specific accounts, and our data insights will help you tailor the messages according to their needs and pain points.

Retain Email Subscribers: Retaining your subscribers is also a concerning challenge. Using our database for account-based marketing will help you to connect with your clients on a personal level. Enabling you to keep them engaged will encourage these clients to consider your products and services whenever they need one.

Increases profits: As Thomson Data’s records heighten the tone for personalization, you can connect with your prospective clientele on a personal level. When you do it in the right way, it is significantly reflected in your sales figures.

Gain Insights: Get 100+ insights from our data to make better decision making and implement in your Account based marketing strategies.

Our ABM data can be used for improving your conversions by 3x times.

Our Account based marketing data can be used by: 1. B2b companies 2. Sales Teams 3. Marketing Teams 4. C- suite Executives 5. Agencies and Service providers 6. Enterprise Level Organizations and more.

Thomson Data is perfect for ABM and will certainly help you run campaigns that target customer acquisition as well as customer retention. We provide you an access to the complete data solution to help you connect and impress your target audience.

Send us a request to know more details about our Account based marketing data and we will be happy to assist you.
Household Survey on Information and Communications Technology 2023 - West...
pcbs.gov.ps
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2025). Household Survey on Information and Communications Technology 2023 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/733
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2023 - 2024
Area covered
West Bank, Gaza Strip, Gaza
Description
Abstract

The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2023. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

Geographic coverage

Palestine, West Bank, Gaza strip

Analysis unit

Household, Individual

Universe

All Palestinian households and individuals (10 years and above) whose usual place of residence in 2023 was in the state of Palestine.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

Sample Size The sample size is 8,040 households.

Sampling Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, camps).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on Individuals (10 years and above) about computer use, access to the Internet, possession of a mobile phone, information threats, and E-commerce.

Cleaning operations

Field Editing and Supervising

• Data collection and coordination were carried out in the field according to the pre-prepared plan, where instructions, models and tools were available for fieldwork. • Audit process on the PC-Tablet is through the establishment of all automated rules and the office on the program to cover all the required controls according to the criteria specified. • For the privacy of Jerusalem (J1) data were collected in a paper questionnaire. Then the supervisor verifies the questionnaire in a formal and technical manner according to the pre-prepared audit rules. • Fieldwork visits was carried out by the project coordinator, supervisors and project management to check edited questionnaire and the performance of fieldworkers.

Data Processing

Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

Response rate

The response rate reached 83.7%.

Sampling error estimates

Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, there is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non-response cases. The total non-response rate reached 16.3%.
D
CNVVE Dataset clean audio samples
darus.uni-stuttgart.de
Updated Feb 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramin Hedeshy; Raphael Menges; Steffen Staab (2024). CNVVE Dataset clean audio samples [Dataset]. http://doi.org/10.18419/DARUS-3898
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3898
Dataset updated
Feb 13, 2024
Dataset provided by
DaRUS
Authors
Ramin Hedeshy; Raphael Menges; Steffen Staab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
BMBF
BMWK/ESF
Description
This CNVVE Dataset contains clean audio samples encompassing six distinct classes of voice expressions, namely “Uh-huh” or “mm-hmm”, “Uh-uh” or “mm-mm”, “Hush” or “Shh”, “Psst”, “Ahem”, and Continuous humming, e.g., “hmmm.” Audio samples of each class are found in the respective folders. These audio samples have undergone a thorough cleaning process. The raw samples are published in https://doi.org/10.18419/darus-3897. Initially, we applied the Google WebRTC voice activity detection (VAD) algorithm on the given audio files to remove noise or silence from the collected voice signals. The intensity was set to "2", which could be a value between "1" and "3". However, because of variations in the data, some files required additional manual cleaning. These outliers, characterized by sharp click sounds (such as those occurring at the end of recordings), were addressed. The samples are recorded through a dedicated website for data collection that defines the purpose and type of voice data by providing example recordings to participants as well as the expressions’ written equivalent, e.g., “Uh-huh”. Audio recordings were automatically saved in the .wav format and kept anonymous, with a sampling rate of 48 kHz and a bit depth of 32 bits. For more info, please check the paper or feel free to contact the authors for any inquiries.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
d
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Moldova (Republic of), Andorra, Tunisia, Taiwan, Canada, Isle of Man, Bangladesh, Northern Mariana Islands, Nepal, British Indian Ocean Territory
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Z
A set of generated Instagram Data Download Packages (DDPs) to investigate...
data.niaid.nih.gov
Updated Jan 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Boeschoten (2021). A set of generated Instagram Data Download Packages (DDPs) to investigate their structure and content [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4472605
Explore at:
Dataset updated
Jan 28, 2021
Dataset provided by
Laura Boeschoten
Ruben van den Goorbergh
Daniel Oberski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Instagram data-download example dataset

In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).

How the data was generated

These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.

The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.

Obtaining your Instagram DDP

After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.

Go to www.instagram.com and log in

Click on your profile picture, go to Settings and Privacy and Security

Scroll to Data download and click Request download

Enter your email adress and click Next

Enter your password and click Request download

Instagram then delivered the data in a compressed zip folder with the format username_YYYYMMDD.zip (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.

Data cleaning

To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.

How this data-set can be used

This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.

Authors

The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.

Acknowledgments

The researchers would like to thank everyone who participated in this data-generation project.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
i
Agriculture Sample Census Survey 2002-2003 - Tanzania
catalog.ihsn.org
datacatalog.ihsn.org
+1more
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Bureau of Statistics (2019). Agriculture Sample Census Survey 2002-2003 - Tanzania [Dataset]. https://catalog.ihsn.org/catalog/1086
Explore at:
Dataset updated
Mar 29, 2019
Dataset provided by
National Bureau of Statistics
Office of Chief Government Statistician-Zanzibar
Time period covered
2004
Area covered
Tanzania
Description
Abstract

The 2003 Agriculture Sample Census was designed to meet the data needs of a wide range of users down to district level including policy makers at local, regional and national levels, rural development agencies, funding institutions, researchers, NGOs, farmer organisations, etc. As a result the dataset is both more numerous in its sample and detailed in its scope compared to previous censuses and surveys. To date this is the most detailed Agricultural Census carried out in Africa.

The census was carried out in order to: · Identify structural changes if any, in the size of farm household holdings, crop and livestock production, farm input and implement use. It also seeks to determine if there are any improvements in rural infrastructure and in the level of agriculture household living conditions; · Provide benchmark data on productivity, production and agricultural practices in relation to policies and interventions promoted by the Ministry of Agriculture and Food Security and other stake holders. · Establish baseline data for the measurement of the impact of high level objectives of the Agriculture Sector Development Programme (ASDP), National Strategy for Growth and Reduction of Poverty (NSGRP) and other rural development programs and projects. · Obtain benchmark data that will be used to address specific issues such as: food security, rural poverty, gender, agro-processing, marketing, service delivery, etc.

Geographic coverage

Tanzania Mainland and Zanzibar

Analysis unit

Households

Individuals

Universe

Large scale, small scale and community farms.

Kind of data

Census/enumeration data [cen]

Sampling procedure

The Mainland sample consisted of 3,221 villages. These villages were drawn from the National Master Sample (NMS) developed by the National Bureau of Statistics (NBS) to serve as a national framework for the conduct of household based surveys in the country. The National Master Sample was developed from the 2002 Population and Housing Census. The total Mainland sample was 48,315 agricultural households. In Zanzibar a total of 317 enumeration areas (EAs) were selected and 4,755 agriculture households were covered. Nationwide, all regions and districts were sampled with the exception of three urban districts (two from Mainland and one from Zanzibar).

In both Mainland and Zanzibar, a stratified two stage sample was used. The number of villages/EAs selected for the first stage was based on a probability proportional to the number of villages in each district. In the second stage, 15 households were selected from a list of farming households in each selected Village/EA, using systematic random sampling, with the village chairpersons assisting to locate the selected households.

Mode of data collection

Face-to-face [f2f]

Research instrument

The census covered agriculture in detail as well as many other aspects of rural development and was conducted using three different questionnaires: • Small scale questionnaire • Community level questionnaire • Large scale farm questionnaire

The small scale farm questionnaire was the main census instrument and it includes questions related to crop and livestock production and practices; population demographics; access to services, resources and infrastructure; and issues on poverty, gender and subsistence versus profit making production unit.

The community level questionnaire was designed to collect village level data such as access and use of common resources, community tree plantation and seasonal farm gate prices.

The large scale farm questionnaire was administered to large farms either privately or corporately managed.

Questionnaire Design The questionnaires were designed following user meetings to ensure that the questions asked were in line with users data needs. Several features were incorporated into the design of the questionnaires to increase the accuracy of the data: • Where feasible all variables were extensively coded to reduce post enumeration coding error. • The definitions for each section were printed on the opposite page so that the enumerator could easily refer to the instructions whilst interviewing the farmer. • The responses to all questions were placed in boxes printed on the questionnaire, with one box per character. This feature made it possible to use scanning and Intelligent Character Recognition (ICR) technologies for data entry. • Skip patterns were used to reduce unnecessary and incorrect coding of sections which do not apply to the respondent. • Each section was clearly numbered, which facilitated the use of skip patterns and provided a reference for data type coding for the programming of CSPro, SPSS and the dissemination applications.

Cleaning operations

Data processing consisted of the following processes: · Data entry · Data structure formatting · Batch validation · Tabulation

Data Entry Scanning and ICR data capture technology for the small holder questionnaire were used on the Mainland. This not only increased the speed of data entry, it also increased the accuracy due to the reduction of keystroke errors. Interactive validation routines were incorporated into the ICR software to track errors during the verification process. The scanning operation was so successful that it is highly recommended for adoption in future censuses/surveys. In Zanzibar all data was entered manually using CSPro.

Prior to scanning, all questionnaires underwent a manual cleaning exercise. This involved checking that the questionnaire had a full set of pages, correct identification and good handwriting. A score was given to each questionnaire based on the legibility and the completeness of enumeration. This score will be used to assess the quality of enumeration and supervision in order to select the best field staff for future censuses/surveys.

CSPro was used for data entry of all Large Scale Farm and community based questionnaires due to the relatively small number of questionnaires. It was also used to enter data from the 2,880 small holder questionnaires that were rejected by the ICR extraction application.

Data Structure Formatting A program was developed in visual basic to automatically alter the structure of the output from the scanning/extraction process in order to harmonise it with the manually entered data. The program automatically checked and changed the number of digits for each variable, the record type code, the number of questionnaires in the village, the consistency of the Village ID Code and saved the data of one village in a file named after the village code.

Batch Validation A batch validation program was developed in order to identify inconsistencies within a questionnaire. This is in addition to the interactive validation during the ICR extraction process. The procedures varied from simple range checking within each variable to the more complex checking between variables. It took six months to screen, edit and validate the data from the smallholder questionnaires. After the long process of data cleaning, tabulations were prepared based on a pre-designed tabulation plan.

Tabulations Statistical Package for Social Sciences (SPSS) was used to produce the Census tabulations and Microsoft Excel was used to organize the tables and compute additional indicators. Excel was also used to produce charts while ArcView and Freehand were used for the maps.

Analysis and Report Preparation The analysis in this report focuses on regional comparisons, time series and national production estimates. Microsoft Excel was used to produce charts; ArcView and Freehand were used for maps, whereas Microsoft Word was used to compile the report.

Data Quality A great deal of emphasis was placed on data quality throughout the whole exercise from planning, questionnaire design, training, supervision, data entry, validation and cleaning/editing. As a result of this, it is believed that the census is highly accurate and representative of what was experienced at field level during the Census year. With very few exceptions, the variables in the questionnaire are within the norms for Tanzania and they follow expected time series trends when compared to historical data. Standard Errors and Coefficients of Variation for the main variables are presented in the Technical Report (Volume I).

Sampling error estimates

The Sampling Error found on page (21) up to page (22) in the Technical Report for Agriculture Sample Census Survey 2002-2003
i
National Labor Force Survey 1989 - Indonesia
catalog.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subdirectorate of Manpower Statistics (2019). National Labor Force Survey 1989 - Indonesia [Dataset]. http://catalog.ihsn.org/catalog/4871
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Subdirectorate of Manpower Statistics
Time period covered
1989
Area covered
Indonesia
Description
Abstract

National Labor Force Survey (SAKERNAS) is a survey that is designed to observe the general situation of workforce and also to understand whether there is a change of workforce structure between the enumeration period. Since the survey was initiated in 1976, it has undergone a series of changes affecting its coverage, the frequency of enumeration, the number of households sampled and the type of information collected. It is the largest and most representative source of employment data in Indonesia. For each selected household, the general information about the circumstances of each household member that includes the name, relationship to head of household, sex, and age were collected. Household members aged 10 years and over will be prompted to give the information about their marital status, education and employment.

SAKERNAS is aimed to gather informations that meet three objectives: 1.Employment by education, working hours, industrial classification and employment status, 2.Unemployment and underemployment by different characteristics and efforts on looking for work, 3.Working age population not in the labor force (e.g. attending schools, doing housekeeping and others).

The data for quarterly SAKERNAS was gathered in 1989 covered all provinces in Indonesia, with 65,440 households, scattered both in rural and urban areas and representative until provincial level. The main household data is taken from core questionnaire of SAK89-AK.

Geographic coverage

National coverage* including urban and rural area, representative until provincial level.

*) Although covering all of Indonesia, there are some circumstances when not all provincial were covered. For example, in year 2000, the Province of Maluku excluded in SAKERNAS because horizontal conflicts occurred there. Also, the separation of East Timor from Indonesia in year 1999 also changed the scope of SAKERNAS for the years to come. After that, due to the expansion of regional autonomy as a consequence, the proportion of samples per Province is also changed, as in 2006 when the number of provinces are already 33. However, the difference is only on the number of influential scope/level but not to the pattern. On the other hand, changes in the methodology (including sample size) over time is likely to affect the outcome, for example in years 2000 and 2001, when sample size is only 32.384 and 34.176 the level of data presentation is only representative to island level, (insufficient sample size even to make it representative to provincial level).

Analysis unit

Individual

Universe

The survey covered all de jure household members (usual residents), aged 10 years and over that resident in the household. However, Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample.

Kind of data

Sample survey data

Sampling procedure

Quarterly SAKERNAS 1989 was implemented in the whole territory of the Republic of Indonesia , with a total sample of about 65,440 households, both in rural and urban areas and representative until provincial level. Diplomatic Corps households, households that are in the specific enumeration area and specific households in the regular enumeration area are not chosen as a sample. Data in the dataset indicates the combined sample data consisting results of the 4 rounds quarterly SAKERNAS in 1989, i.e. quarter I, quarter II, quarter III, and quarter IV.

Implementation of SAKERNAS 1989 include samples of the previous enumeration activities (rotation method). Sampling method* to be used is similar for implementation of SAKERNAS years 1986 to 1989, which households selected samples from previous quarter will be partly re-enumerated and then again partly from other household ever elected from another previous quarters, so no need to re-enroll in new household. The procedure for the selection of households in the sample are described in more detail in the enumerators/ supervisors manual document.

*) Sampling method used is varied in different years. For example, in SAKERNAS period of 1986-1989 sampling method used is the method of rotation, where most of the households selected at one period was re-elected in the following period. This often happens on quarterly SAKERNAS on that period. At other periods often use multi-stages sampling method (two or three stages depend on whether sub block census / segment group included or not), or a combination of multi stages sampling also with rotation method (e.g. SAKERNAS 2006-2010).

Mode of data collection

Face-to-face

Research instrument

In SAKERNAS, the questionnaire has been designed in a simple and concise way. It is expected that respondents will understand the aim of question of survey and avoid the memory lapse and uninterested respondents during data collection. Furthermore, the design of SAKERNAS's questionnaire remains stable in order to maintain data comparison.

A household questionnaire was administered in each selected household, which collected general information of household members that includes name, relationship with head of the household, sex and age. Household members aged 10 years and over were then asked about their marital status, education and occupation.

Cleaning operations

Stages of data processing in Sakernas are through process of: - Batching - Editing - Coding - Data Entry - Validation - Tabulation

Sampling error estimates

Sampling error results are presented at the end of the publication of The State of Labor Force in Indonesia and in publication of The State of Workers in Indonesia.
Quantitative Service Delivery Survey in Health 2000 - Uganda
microdata.ubos.org
dev.ihsn.org
+3more
Updated Feb 14, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Makerere Institute for Social Research, Uganda (2018). Quantitative Service Delivery Survey in Health 2000 - Uganda [Dataset]. https://microdata.ubos.org:7070/index.php/catalog/46
Explore at:
Dataset updated
Feb 14, 2018
Dataset provided by
World Bankhttp://worldbank.org/
Ministry of Health of Ugandahttp://www.health.go.ug/
Makerere Institute for Social Research, Uganda
Ministry of Finance, Planning and Economic Development, Uganda
Time period covered
2000
Area covered
Uganda
Description
Abstract

This study examines various dimensions of primary health care delivery in Uganda, using a baseline survey of public and private dispensaries, the most common lower level health facilities in the country.

The survey was designed and implemented by the World Bank in collaboration with the Makerere Institute for Social Research and the Ugandan Ministries of Health and of Finance, Planning and Economic Development. It was carried out in October - December 2000 and covered 155 local health facilities and seven district administrations in ten districts. In addition, 1617 patients exiting health facilities were interviewed. Three types of dispensaries (both with and without maternity units) were included: those run by the government, by private for-profit providers, and by private nonprofit providers, mainly religious.

This research is a Quantitative Service Delivery Survey (QSDS). It collected microlevel data on service provision and analyzed health service delivery from a public expenditure perspective with a view to informing expenditure and budget decision-making, as well as sector policy.

Objectives of the study included: 1) Measuring and explaining the variation in cost-efficiency across health units in Uganda, with a focus on the flow and use of resources at the facility level; 2) Diagnosing problems with facility performance, including the extent of drug leakage, as well as staff performance and availability;
3) Providing information on pricing and user fee policies and assessing the types of service actually provided; 4) Shedding light on the quality of service across the three categories of service provider - government, for-profit, and nonprofit; 5) Examining the patterns of remuneration, pay structure, and oversight and monitoring and their effects on health unit performance; 6) Assessing the private-public partnership, particularly the program of financial aid to nonprofits.

Geographic coverage

The study districts were Mpigi, Mukono, and Masaka in the central region; Mbale, Iganga, and Soroti in the east; Arua and Apac in the north; and Mbarara and Bushenyi in the west.

Analysis unit

local dispensary with or without maternity unit

Universe

The survey covered government, for-profit and nonprofit private dispensaries with or without maternity units in ten Ugandan districts.

Kind of data

Sample survey data [ssd]

Sampling procedure

The survey covered government, for-profit and nonprofit private dispensaries with or without maternity units in ten Ugandan districts.

The sample design was governed by three principles. First, to ensure a degree of homogeneity across sampled facilities, attention was restricted to dispensaries, with and without maternity units (that is, to the health center III level). Second, subject to security constraints, the sample was intended to capture regional differences. Finally, the sample had to include facilities in the main ownership categories: government, private for-profit, and private nonprofit (religious organizations and NGOs). The sample of government and nonprofit facilities was based on the Ministry of Health facility register for 1999. Since no nationwide census of for-profit facilities was available, these facilities were chosen by asking sampled government facilities to identify the closest private dispensary.

Of the 155 health facilities surveyed, 81 were government facilities, 30 were private for-profit facilities, and 44 were nonprofit facilities. An exit poll of clients covered 1,617 individuals.

The final sample consisted of 155 primary health care facilities drawn from ten districts in the central, eastern, northern, and western regions of the country. It included government, private for-profit, and private nonprofit facilities. The nonprofit sector includes facilities owned and operated by religious organizations and NGOs. Approximately one third of the surveyed facilities were dispensaries without maternity units; the rest provided maternity care. The facilities varied considerably in size, from units run by a single individual to facilities with as many as 19 staff members.

Ministry of Health facility register for 1999 was used to design the sampling frame. Ten districts were randomly selected. From the selected districts, a sample of government and private nonprofit facilities and a reserve list of replacement facilities were randomly drawn. Because of the unreliability of the register for private for-profit facilities, it was decided that for-profit facilities would be identified on the basis of information from the government facilities sampled. The administrative records for facilities in the original sample were first reviewed at the district headquarters, where some facilities that did not meet selection criteria and data collection requirements were dropped from the sample. These were replaced by facilities from the reserve list. Overall, 30 facilities were replaced.

The sample was designed in such a way that the proportion of facilities drawn from different regions and ownership categories broadly mirrors that of the universe of facilities. Because no nationwide census of for-profit health facilities is available, it is difficult to assess the extent to which the sample is representative of this category. A census of health care facilities in selected districts, carried out in the context of the Delivery of Improved Services for Health (DISH) project supported by the U.S. Agency for International Development (USAID), suggests that about 63 percent of all facilities operate on a for-profit basis, while government and nonprofit providers run 26 and 11 percent of facilities, respectively. This would suggest an undersampling of private providers in the survey. It is not clear, however, whether the DISH districts are representative of other districts in Uganda in terms of the market for health care.

For the exit poll, 10 interviews per facility were carried out in approximately 85 percent of the facilities. In the remaining facilities the target of 10 interviews was not met, as a result of low activity levels.

Sampling deviation

In the first stage in the sampling process, eight districts (out of 45) had to be dropped from the sample frame due to security concerns. These districts were Bundibugyo, Gulu, Kabarole, Kasese, Kibaale, Kitgum, Kotido, and Moroto.

Mode of data collection

Face-to-face [f2f]

Research instrument

The following survey instruments are available:

District Health Team Questionnaire;

District Facility Data Sheets;

Uganda Health Facility Survey Questionnaire;

Facility Data Sheets;

Facility Patient Exit Poll Questionnaire.

The survey collected data at three levels: district administration, health facility, and client. In this way it was possible to capture central elements of the relationships between the provider organization, the frontline facility, and the user. In addition, comparison of data from different levels (triangulation) permitted cross-validation of information.

At the district level, a District Health Team Questionnaire was administered to the district director of health services (DDHS), who was interviewed on the role of the DDHS office in health service delivery. Specifically, the questionnaire collected data on health infrastructure, staff training, support and supervision arrangements, and sources of financing.

The District Facility Data Sheet was used at the district level to collect more detailed information on the sampled health units for fiscal 1999-2000, including data on staffing and the related salary structures, vaccine supplies and immunization activity, and basic and supplementary supplies of drugs to the facilities. In addition, patient data, including monthly returns from facilities on total numbers of outpatients, inpatients, immunizations, and deliveries, were reviewed for the period April-June 2000.

At the facility level, the Uganda Health Facility Survey Questionnaire collected a broad range of information related to the facility and its activities. The questionnaire, which was administered to the in-charge, covered characteristics of the facility (location, type, level, ownership, catchment area, organization, and services); inputs (staff, drugs, vaccines, medical and nonmedical consumables, and capital inputs); outputs (facility utilization and referrals); financing (user charges, cost of services by category, expenditures, and financial and in-kind support); and institutional support (supervision, reporting, performance assessment, and procurement). Each health facility questionnaire was supplemented by a Facility Data Sheet (FDS). The FDS was designed to obtain data from the health unit records on staffing and the related salary structure; daily patient records for fiscal 1999-2000; the type of patients using the facility; vaccinations offered; and drug supply and use at the facility.

Finally, at the facility level, an exit poll was used to interview about 10 patients per facility on the cost of treatment, drugs received, perceived quality of services, and reasons for using that unit instead of alternative sources of health care.

Cleaning operations

Detailed information about data editing procedures is available in "Data Cleaning Guide for PETS/QSDS Surveys" in external resources.

STATA cleaning do-files and the data quality reports on the datasets can also be found in external resources.
f
S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOS ONE
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
🔍 Diverse CSV Dataset Samples
kaggle.com
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samy Baladram
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
https://i.imgur.com/PcSDv8A.png" alt="Imgur">

Overview

The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

Files

Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

Format

The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

Quality Assurance

The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

Acknowledgements

The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">
US Commercial And Residential Cleaning Services Market Analysis, Size, and...
technavio.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). US Commercial And Residential Cleaning Services Market Analysis, Size, and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/commercial-and-residential-cleaning-services-market-industry-analysis
Explore at:
Dataset updated
Jan 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States
Description
Snapshot img

US Commercial and Residential Cleaning Services Market Size 2025-2029

The US commercial & residential cleaning services market size is forecast to increase by USD 37.8 billion at a CAGR of 5.9% between 2024 and 2029.

The US Commercial and Residential Cleaning Services Market is experiencing significant growth, driven by the increasing demand for professional cleaning services in both sectors. One key trend is the rising popularity of multifamily dwellings, which present a substantial opportunity for market expansion. Additionally, strategic alliances between industry players are increasingly common, enabling companies to broaden their reach and enhance their offerings. However, the market is not without challenges, most notably the fluctuations in labor wages, which can impact profitability and operational efficiency. To capitalize on market opportunities and navigate challenges effectively, companies must stay informed of industry trends and adapt to the evolving landscape. By focusing on innovation, strategic partnerships, and cost management, they can differentiate themselves and maintain a competitive edge.

What will be the size of the US Commercial And Residential Cleaning Services Market during the forecast period?

Request Free Sample

The cleaning services market encompasses both commercial and residential properties, catering to the essential duties of maintaining hygiene, health, and cleanliness. In the US, this market exhibits significant activity, driven by the varying cleaning needs of diverse facility types. Commercial properties, including offices, cleanrooms, and industrial spaces, prioritize general cleaning, deep cleaning, and specialized technology to meet stringent sanitary requirements. Residential properties require equally important cleaning services, focusing on customer experience and trained cleaners. Cleaning methods and techniques continue to evolve, with an emphasis on advanced sanitizing and disinfection processes. Cleaning companies invest in innovative cleaning equipment and supplies to meet the demands of their clients. Industrial cleaning services ensure the highest cleaning standards in large-scale facilities, while specialized cleaning companies cater to unique needs, such as medical and healthcare facilities. The cleaning services market is a critical component of maintaining a clean and healthy environment, ensuring businesses and homes operate efficiently and effectively. The market's continued growth is a testament to the importance of cleanliness and the ongoing demand for professional cleaning services.

How is this market segmented?

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Sector Commercial Residential Service Type Janitorial services Carpet and upholstery cleaning services Outdoor areas Others Technique Traditional techniques Eco-friendly techniques End-User Households Offices Healthcare Facilities Retail Service Mode One-Time Recurring (Daily, Weekly, Monthly) Seasonal Geography North America US

By Sector Insights

The commercial segment is estimated to witness significant growth during the forecast period.

The commercial and residential cleaning services market in the US caters to various facility types, including offices, cleanrooms, medical facilities, schools, commercial kitchens, and residential properties. The commercial segment is driven by the need for maintaining hygiene and health in workplaces and healthcare establishments. The cleaning duties for commercial properties involve general cleaning, deep cleaning, sanitizing, and disinfection using industrial-grade equipment and cleaning supplies. The residential segment focuses on household cleaning tools for domestic dwellings, ensuring the quality of cleaning, dependability, and customer experience. The cleaning frequency and intensity vary based on the facility type and cleaning needs.

Trained cleaners employ specialized cleaning techniques and methods to preserve cleanliness and prevent property damage. The effectiveness of cleaning is crucial, and many services offer eco-friendly, or 'green,' cleaning solutions. Sanitizing and disinfection, including electrostatic spray disinfection, are essential for maintaining hygienic conditions in healthcare facilities and commercial kitchens. Bonded and insured cleaning services ensure a reliable and trustworthy cleaning experience for clients.

Get a glance at the market share of various segments Request Free Sample

The Commercial segment was valued at USD 78.20 billion in 2019 and showed a gradual increase during the forecast period.

Market Dynamics

Our researchers analyzed the data with 2024 as the base year,
h
CleanPatrick
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Dermatology Lab @ UniBas, CleanPatrick [Dataset]. https://huggingface.co/datasets/Digital-Dermatology/CleanPatrick
Explore at:
Dataset authored and provided by
Digital Dermatology Lab @ UniBas
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
CleanPatrick: A Benchmark for Data Cleaning

Welcome to CleanPatrick, the first large-scale benchmark designed for data cleaning in the image domain. Built on the Fitzpatrick17k dermatology dataset, CleanPatrick is a dataset for measuring the performance in detecting three major data quality issues: off-topic samples, near-duplicates, and label errors.

Overview

CleanPatrick consists of dermatological images annotated with over 500,000 binary labels across three data… See the full description on the dataset page: https://huggingface.co/datasets/Digital-Dermatology/CleanPatrick.
h
Data to Impact on various cleaning procedures on p-GaN surfaces
rodare.hzdr.de
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaber, Jana; Xiang, Rong; Arnold, André; Ryzhov, Anton; Teichert, Jochen; Murcek, Petr; Zwartek, Paul; Ma, Shuai; Michel, Peter (2023). Data to Impact on various cleaning procedures on p-GaN surfaces [Dataset]. http://doi.org/10.14278/rodare.2168
Explore at:
Unique identifier
https://doi.org/10.14278/rodare.2168
Dataset updated
Feb 23, 2023
Authors
Schaber, Jana; Xiang, Rong; Arnold, André; Ryzhov, Anton; Teichert, Jochen; Murcek, Petr; Zwartek, Paul; Ma, Shuai; Michel, Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder "XPS data" contains original and evaluated XPS data (.vms) on a p-GaN sample which was treated at various temperatures and underwent Ar+ irradiation.

Furthermore, the folder "REM Images" contains REM images (.tif) and EDX data (.xlsx) on the used excessively treated sample.

All images that are published in the main manuscript are collected as .tif files in the folder "images".
Data from: Evaluation of alternative-design cotton gin lint cleaning...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Evaluation of alternative-design cotton gin lint cleaning machines on fiber length uniformity index [Dataset]. https://catalog.data.gov/dataset/data-from-evaluation-of-alternative-design-cotton-gin-lint-cleaning-machines-on-fiber-leng
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This is USDA-ARS data from the publication: "Evaluation of Alternative-Design Cotton Gin Lint Cleaning Machines on Fiber Length Uniformity Index". The study was conducted in 2018 & 2019 with continued sample and data analysis through August 2023. Developing cotton ginning methods that improve fiber length uniformity index to levels that are compatible with the newer and more efficient spinning technologies would expand market share and increase the demand for cotton products and give U.S. cotton a competitive edge to synthetic fibers. Older studies on lint cleaning machines showed that the most widely used feed mechanism that places fiber on the cleaning cylinder damages fiber and reduces uniformity. The present study evaluates how conventional and experimental feed mechanisms affect uniformity. The lint cleaners were used with both saw and roller gin stands. Four diverse cotton cultivars from the Far West, Southwest, and Mid-South were used in the test. The data included gining process variables, raw seed cotton characteristics, raw lint High Volume Instrument (HVI), Advanced Fiber Information System (AFIS), and micro-dust and trash analyzer (MDTA3) measurements.
h
codeparrot-clean
huggingface.co
Updated Dec 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2021). codeparrot-clean [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2021
Dataset authored and provided by
CodeParrot
Description
CodeParrot 🦜 Dataset Cleaned

What is it?

A dataset of Python files from Github. This is the deduplicated version of the codeparrot.

Processing

The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

Deduplication Remove exact matches

Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)

For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

151 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Data for A Conceptual Model for Transparent, Reusable, and Collaborative...

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

B2B Intent Data - ABM Data - 152M+ Profiles - 13M+ Companies - 150+ Data...

Household Survey on Information and Communications Technology 2023 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

CNVVE Dataset clean audio samples

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Company Datasets for Business Profiling

A set of generated Instagram Data Download Packages (DDPs) to investigate...

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Agriculture Sample Census Survey 2002-2003 - Tanzania

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Sampling error estimates

National Labor Force Survey 1989 - Indonesia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Sampling error estimates

Quantitative Service Delivery Survey in Health 2000 - Uganda

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation