95 datasets found

cyclistic dataset
kaggle.com
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M-Farheen (2024). cyclistic dataset [Dataset]. https://www.kaggle.com/datasets/dsnerd00/cyclistic-dataset/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M-Farheen
Description
Google Data Analytics Capstone Project

Cyclistic Dataset

Case Study: How Does a Bike-Share Navigate Speedy Success?

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams.

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Data overview

ride_id: It is a distinct identifier assigned to each individual ride. rideable_type: This column indicates the type of bikes used for each ride. started_at: This column denotes the timestamp when a particular ride began. ended_at: This column represents the timestamp when a specific ride concluded. start_station_name: This column contains the name of the station where the bike ride originated. start_station_id: This column represents the unique identifier for the station where the bike ride originated. end_station_name: This column contains the name of the station where the bike ride concluded. end_station_id: This column represents the unique identifier for the station where the bike ride concluded. start_lat: This column denotes the latitude coordinate of the starting point of the bike ride. start_lng: This column denotes the longitude coordinate of the starting point of the bike ride. end_lat: This column denotes the latitude coordinate of the ending point of the bike ride. end_lng: This column denotes the longitude coordinate of the ending point of the bike ride. member_casual: This column indicates whether the rider is a member or a casual user.

Data Science Platform Market Analysis North America, Europe, APAC, South...

technavio.com

Updated Feb 13, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Data Science Platform Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Germany, China, Canada, UK, India, France, Japan, Brazil, UAE - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis

Explore at:

Dataset updated

Feb 13, 2025

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

Global, United Kingdom, United States

Description

Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the integration of artificial intelligence (AI) and machine learning (ML). This enhancement enables more advanced data analysis and prediction capabilities, making data science platforms an essential tool for businesses seeking to gain insights from their data. Another trend shaping the market is the emergence of containerization and microservices in platforms. This development offers increased flexibility and scalability, allowing organizations to efficiently manage their projects. 
However, the use of platforms also presents challenges, particularly In the area of data privacy and security. Ensuring the protection of sensitive data is crucial for businesses, and platforms must provide strong security measures to mitigate risks. In summary, the market is witnessing substantial growth due to the integration of AI and ML technologies, containerization, and microservices, while data privacy and security remain key challenges.

What will be the Size of the Data Science Platform Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth due to the increasing demand for advanced data analysis capabilities in various industries. Cloud-based solutions are gaining popularity as they offer scalability, flexibility, and cost savings. The market encompasses the entire project life cycle, from data acquisition and preparation to model development, training, and distribution. Big data, IoT, multimedia, machine data, consumer data, and business data are prime sources fueling this market's expansion. Unstructured data, previously challenging to process, is now being effectively managed through tools and software. Relational databases and machine learning models are integral components of platforms, enabling data exploration, preprocessing, and visualization.
Moreover, Artificial intelligence (AI) and machine learning (ML) technologies are essential for handling complex workflows, including data cleaning, model development, and model distribution. Data scientists benefit from these platforms by streamlining their tasks, improving productivity, and ensuring accurate and efficient model training. The market is expected to continue its growth trajectory as businesses increasingly recognize the value of data-driven insights.

How is this Data Science Platform Industry segmented and which is the largest segment?

The industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  On-premises
  Cloud


Component

  Platform
  Services


End-user

  BFSI
  Retail and e-commerce
  Manufacturing
  Media and entertainment
  Others


Sector

  Large enterprises
  SMEs


Geography

  North America

    Canada
    US


  Europe

    Germany
    UK
    France


  APAC

    China
    India
    Japan


  South America

    Brazil


  Middle East and Africa

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.

On-premises deployment is a traditional method for implementing technology solutions within an organization. This approach involves purchasing software with a one-time license fee and a service contract. On-premises solutions offer enhanced security, as they keep user credentials and data within the company's premises. They can be customized to meet specific business requirements, allowing for quick adaptation. On-premises deployment eliminates the need for third-party providers to manage and secure data, ensuring data privacy and confidentiality. Additionally, it enables rapid and easy data access, and keeps IP addresses and data confidential. This deployment model is particularly beneficial for businesses dealing with sensitive data, such as those in manufacturing and large enterprises. While cloud-based solutions offer flexibility and cost savings, on-premises deployment remains a popular choice for organizations prioritizing data security and control.

Get a glance at the Data Science Platform Industry report of share of various segments. Request Free Sample

The on-premises segment was valued at USD 38.70 million in 2019 and showed a gradual increase during the forecast period.

Regional Analysis

North America is estimated to contribute 48% to the growth of the global market during the forecast period.

Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

For more insights on the market share of various regions, Request F

Google Data Analytics Capstone
kaggle.com
Updated Aug 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reilly McCarthy (2022). Google Data Analytics Capstone [Dataset]. https://www.kaggle.com/datasets/reillymccarthy/google-data-analytics-capstone/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Reilly McCarthy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:

Ask

Prepare

Process

Analyze

Share

Act

The data I used for this analysis comes from this FitBit data set: https://www.kaggle.com/datasets/arashnic/fitbit

" This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "
Data from: A protocol for conducting and presenting results of...
zenodo.org
search.dataone.org
+1more
bin, txt
Updated May 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alain F. Zuur; Elena N. Ieno; Alain F. Zuur; Elena N. Ieno (2022). Data from: A protocol for conducting and presenting results of regression-type analyses [Dataset]. http://doi.org/10.5061/dryad.v4t42
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.v4t42
Dataset updated
May 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alain F. Zuur; Elena N. Ieno; Alain F. Zuur; Elena N. Ieno
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Scientific investigation is of value only insofar as relevant results are obtained and communicated, a task that requires organizing, evaluating, analysing and unambiguously communicating the significance of data. In this context, working with ecological data, reflecting the complexities and interactions of the natural world, can be a challenge. Recent innovations for statistical analysis of multifaceted interrelated data make obtaining more accurate and meaningful results possible, but key decisions of the analyses to use, and which components to present in a scientific paper or report, may be overwhelming. We offer a 10-step protocol to streamline analysis of data that will enhance understanding of the data, the statistical models and the results, and optimize communication with the reader with respect to both the procedure and the outcomes. The protocol takes the investigator from study design and organization of data (formulating relevant questions, visualizing data collection, data exploration, identifying dependency), through conducting analysis (presenting, fitting and validating the model) and presenting output (numerically and visually), to extending the model via simulation. Each step includes procedures to clarify aspects of the data that affect statistical analysis, as well as guidelines for written presentation. Steps are illustrated with examples using data from the literature. Following this protocol will reduce the organization, analysis and presentation of what may be an overwhelming information avalanche into sequential and, more to the point, manageable, steps. It provides guidelines for selecting optimal statistical tools to assess data relevance and significance, for choosing aspects of the analysis to include in a published report and for clearly communicating information.
f
Data from: pmartR: Quality Control and Statistics for Mass...
acs.figshare.com
figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelly G. Stratton; Bobbie-Jo M. Webb-Robertson; Lee Ann McCue; Bryan Stanfill; Daniel Claborne; Iobani Godinez; Thomas Johansen; Allison M. Thompson; Kristin E. Burnum-Johnson; Katrina M. Waters; Lisa M. Bramer (2023). pmartR: Quality Control and Statistics for Mass Spectrometry-Based Biological Data [Dataset]. http://doi.org/10.1021/acs.jproteome.8b00760.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.8b00760.s001
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Kelly G. Stratton; Bobbie-Jo M. Webb-Robertson; Lee Ann McCue; Bryan Stanfill; Daniel Claborne; Iobani Godinez; Thomas Johansen; Allison M. Thompson; Kristin E. Burnum-Johnson; Katrina M. Waters; Lisa M. Bramer
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Prior to statistical analysis of mass spectrometry (MS) data, quality control (QC) of the identified biomolecule peak intensities is imperative for reducing process-based sources of variation and extreme biological outliers. Without this step, statistical results can be biased. Additionally, liquid chromatography–MS proteomics data present inherent challenges due to large amounts of missing data that require special consideration during statistical analysis. While a number of R packages exist to address these challenges individually, there is no single R package that addresses all of them. We present pmartR, an open-source R package, for QC (filtering and normalization), exploratory data analysis (EDA), visualization, and statistical analysis robust to missing data. Example analysis using proteomics data from a mouse study comparing smoke exposure to control demonstrates the core functionality of the package and highlights the capabilities for handling missing data. In particular, using a combined quantitative and qualitative statistical test, 19 proteins whose statistical significance would have been missed by a quantitative test alone were identified. The pmartR package provides a single software tool for QC, EDA, and statistical comparisons of MS data that is robust to missing data and includes numerous visualization capabilities.
f
Data analysis steps for each package in SDA-V2.
plos.figshare.com
zip
Updated Jul 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jularat Chumnaul; Mohammad Sepehrifar (2024). Data analysis steps for each package in SDA-V2. [Dataset]. http://doi.org/10.1371/journal.pone.0297930.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297930.s001
Dataset updated
Jul 3, 2024
Dataset provided by
PLOS ONE
Authors
Jularat Chumnaul; Mohammad Sepehrifar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data analysis can be accurate and reliable only if the underlying assumptions of the used statistical method are validated. Any violations of these assumptions can change the outcomes and conclusions of the analysis. In this study, we developed Smart Data Analysis V2 (SDA-V2), an interactive and user-friendly web application, to assist users with limited statistical knowledge in data analysis, and it can be freely accessed at https://jularatchumnaul.shinyapps.io/SDA-V2/. SDA-V2 automatically explores and visualizes data, examines the underlying assumptions associated with the parametric test, and selects an appropriate statistical method for the given data. Furthermore, SDA-V2 can assess the quality of research instruments and determine the minimum sample size required for a meaningful study. However, while SDA-V2 is a valuable tool for simplifying statistical analysis, it does not replace the need for a fundamental understanding of statistical principles. Researchers are encouraged to combine their expertise with the software’s capabilities to achieve the most accurate and credible results.
Global Manufacturing Analytics Market Size By Component Type (Software,...
verifiedmarketresearch.com
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Global Manufacturing Analytics Market Size By Component Type (Software, Services), By Deployment Type (On-Premises, Cloud-Based), By Application (Predictive Maintenance, Quality Management, Supply Chain Optimization, Energy Management), By Geographic Scope and Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/global-manufacturing-analytics-market-size-and-forecast/
Explore at:
Dataset updated
Apr 26, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2024 - 2031
Area covered
Global
Description
Global Manufacturing Analytics Market size was valued at USD 10.44 Billion in 2024 and is projected to reach USD 44.76 Billion by 2031, growing at a CAGR of 22.01% from 2024 to 2031.

Global Manufacturing Analytics Market Drivers

Growing Adoption of Industrial Internet of Things (IIoT): As more sensors and connected devices are used in manufacturing processes, massive volumes of data are generated. This increases the demand for analytics solutions in order to extract useful insights from the data.

Demand for Operational Efficiency: In order to increase output, cut expenses, and minimize downtime, manufacturers strive to improve their operations. Real-time operational data analysis is made possible by analytics systems, which promote proactive decision-making and process enhancements.

Growing Complexity in production Processes: With numerous steps, variables, and dependencies, modern production processes are getting more and more complicated. These intricate processes can be analyzed and optimized with the help of analytics technologies to increase productivity and quality.

Emphasis on Predictive Maintenance: To reduce downtime and prevent equipment breakdowns, manufacturers are implementing predictive maintenance procedures. By using machine learning algorithms to evaluate equipment data and forecast maintenance requirements, manufacturing analytics systems can optimize maintenance schedules and minimize unscheduled downtime.

Quality Control and Compliance Requirements: The use of analytics solutions in manufacturing is influenced by strict quality control guidelines and legal compliance obligations. Manufacturers may ensure compliance with quality standards and laws by using these technologies to monitor and evaluate product quality metrics in real-time.

Demand for Supply Chain Optimization: In an effort to increase productivity, save expenses, and boost customer happiness, manufacturers are putting more and more emphasis on supply chain optimization. Analytics tools give manufacturers insight into the workings of their supply chains, allowing them to spot bottlenecks, maximize inventory, and enhance logistical procedures.

Technological Developments in Big Data and Analytics: The production of analytics solutions is becoming more innovative due to advances in machine learning, artificial intelligence, and big data analytics. Thanks to these developments, manufacturers can now analyze massive amounts of data in real time, derive insights that can be put into practice, and improve their operations continuously.
A
‘Main steps in the thermal refurbishment process ’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Main steps in the thermal refurbishment process ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-europa-eu-main-steps-in-the-thermal-refurbishment-process-5bbe/latest
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Main steps in the thermal refurbishment process ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/719932f5-e477-4f19-9219-716609a540d7 on 13 January 2022.

--- Dataset description provided by original source is as follows ---

Main steps in the thermal refurbishment process

--- Original source retains full ownership of the source dataset ---
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Feb 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Feb 23, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
f
Data from: An Automated Data Analysis Pipeline for GC−TOF−MS Metabonomics...
figshare.com
acs.figshare.com
txt
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenxin Jiang; Yunping Qiu; Yan Ni; Mingming Su; Wei Jia; Xiuxia Du (2023). An Automated Data Analysis Pipeline for GC−TOF−MS Metabonomics Studies [Dataset]. http://doi.org/10.1021/pr1007703.s009
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/pr1007703.s009
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Wenxin Jiang; Yunping Qiu; Yan Ni; Mingming Su; Wei Jia; Xiuxia Du
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Recent technological advances have made it possible to carry out high-throughput metabonomics studies using gas chromatography coupled with time-of-flight mass spectrometry. Large volumes of data are produced from these studies and there is a pressing need for algorithms that can efficiently process and analyze data in a high-throughput fashion as well. We present an Automated Data Analysis Pipeline (ADAP) that has been developed for this purpose. ADAP consists of peak detection, deconvolution, peak alignment, and library search. It allows data to flow seamlessly through the analysis steps without any human intervention and features two novel algorithms in the analysis. Specifically, clustering is successfully applied in deconvolution to resolve coeluting compounds that are very common in complex samples and a two-phase alignment process has been implemented to enhance alignment accuracy. ADAP is written in standard C++ and R and uses parallel computing via Message Passing Interface for fast peak detection and deconvolution. ADAP has been applied to analyze both mixed standards samples and serum samples and identified and quantified metabolites successfully. ADAP is available at http://www.du-lab.org.
d
Data from: The role of Data Science and AI for predicting the decline of...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azevedo, Caio da Silva; Borges, Aline de Fátima Soares (2023). The role of Data Science and AI for predicting the decline of professionals in the recruitment process: augmenting decision-making in human resources management [Dataset]. http://doi.org/10.7910/DVN/OZJCFG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OZJCFG
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Azevedo, Caio da Silva; Borges, Aline de Fátima Soares
Description
The role of Data Science and AI for predicting the decline of professionals in the recruitment process: augmenting decision-making in human resources management Features Description: Declined: Variable to be predict, where value 0 means that the candi- date continued in the recruit- ment process until the hiring, and value 1 implies the candi- date’s declination from recruit- ment process. ValueClient: The total amount the customer plan to pay by the hired candidate. The value 0 means that client yet did not define a value to pay the candidate. Values must be greater than or equal to 0. ExtraCost: Extra cost the customer has to pay to hire the candidate. Values must be greater than or equal to 0. ValueResources: Requested value by the candidate to work. The value 0 means that the candidate did not request a salary amount yet an this value will be negotiate later. Values must be greater than or equal to 0. Net: The difference between the “ValueClient”, yearly taxes and “ValueResources”. Negative values mean that the amount the client plans to pay the candidate has not yet been defined and is still open for negotiation. DaysOnContact: Number of days that the candidate is in the “Contact” step of the recruitment process. Values must be greater than or equal to 0. DaysOnInterview: Number of days that the candidate is in the “Interview” step of the recruitment process. Values must be greater than or equal to 0. DaysOnSendCV: Number of days that the candidate is in the “Send CV” step of the recruitment process. Values must be greater than or equal to 0. DaysOnReturn: Number of days that the candidate is in the “Return” step of the recruitment process. Values must be greater than or equal to 0. DaysOnCSchedule: Number of days that the candidate is in the “C. Schedule” step of the recruitment process. Values must be greater than or equal to 0. DaysOnCRealized: Number of days that the candidate is in the “C. Realized” step of the recruitment process. Values must be greater than or equal to 0. ProcessDuration: Duration of entire recruitment process in days. Values must be greater than or equal to 0
d
Long-term monotonic trends in annual and monthly streamflow metrics at...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Long-term monotonic trends in annual and monthly streamflow metrics at streamgages in the United States (ver. 2.0, October 2024) [Dataset]. https://catalog.data.gov/dataset/long-term-monotonic-trends-in-annual-and-monthly-streamflow-metrics-at-streamgages-in-the-
Explore at:
Dataset updated
Oct 5, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
The U.S. Geological Survey (USGS) Water Resources Mission Area (WMA) is working to address a need to understand where the Nation is experiencing water shortages or surpluses relative to the demand for water need by delivering routine assessments of water supply and demand and an understanding of the natural and human factors affecting the balance between supply and demand. A key part of these national assessments is identifying long-term trends in water availability, including groundwater and surface water quantity, quality, and use. This data release contains Mann-Kendall monotonic trend analyses for 18 observed annual and monthly streamflow metrics at 6,347 U.S. Geological Survey streamgages located in the conterminous United States, Alaska, Hawaii, and Puerto Rico. Streamflow metrics include annual mean flow, maximum 1-day and 7-day flows, minimum 7-day and 30-day flows, and the date of the center of volume (the date on which 50% of the annual flow has passed by a gage), along with the mean flow for each month of the year. Annual streamflow metrics are computed from mean daily discharge records at U.S. Geological Survey streamgages that are publicly available from the National Water Information System (NWIS). Trend analyses are computed using annual streamflow metrics computed through climate year 2022 (April 2022- March 2023) for low-flow metrics and water year 2022 (October 2021 - September 2022) for all other metrics. Trends at each site are available for up to four different periods: (i) the longest possible period that meets completeness criteria at each site, (ii) 1980-2020, (iii) 1990-2020, (iv) 2000-2020. Annual metric time series analyzed for trends must have 80 percent complete records during fixed periods. In addition, each of these time series must have 80 percent complete records during their first and last decades. All longest possible period time series must be at least 10 years long and have annual metric values for at least 80% of the years running from 2013 to 2022. This data release provides the following five CSV output files along with a model archive: (1) streamflow_trend_results.csv - contains test results of all trend analyses with each row representing one unique combination of (i) NWIS streamgage identifiers, (ii) metric (computed using Oct 1 - Sep 30 water years except for low-flow metrics computed using climate years (Apr 1 - Mar 31), (iii) trend periods of interest (longest possible period through 2022, 1980-2020, 1990-2020, 2000-2020) and (iv) records containing either the full trend period or only a portion of the trend period following substantial increases in cumulative upstream reservoir storage capacity. This is an output from the final process step (#5) of the workflow. (2) streamflow_trend_trajectories_with_confidence_bands.csv - contains annual trend trajectories estimated using Theil-Sen regression, which estimates the median of the probability distribution of a metric for a given year, along with 90 percent confidence intervals (5th and 95h percentile values). This is an output from the final process step (#5) of the workflow. (3) streamflow_trend_screening_all_steps.csv - contains the screening results of all 7,873 streamgages initially considered as candidate sites for trend analysis and identifies the screens that prevented some sites from being included in the Mann-Kendall trend analysis. (4) all_site_year_metrics.csv - contains annual time series values of streamflow metrics computed from mean daily discharge data at 7,873 candidate sites. This is an output of Process Step 1 in the workflow. (5) all_site_year_filters.csv - contains information about the completeness and quality of daily mean discharge at each streamgage during each year (water year, climate year, and calendar year). This is also an output of Process Step 1 in the workflow and is combined with all_site_year_metrics.csv in Process Step 2. In addition, a .zip file contains a model archive for reproducing the trend results using R 4.4.1 statistical software. See the README file contained in the model archive for more information. Caution must be exercised when utilizing monotonic trend analyses conducted over periods of up to several decades (and in some places longer ones) due to the potential for confounding deterministic gradual trends with multi-decadal climatic fluctuations. In addition, trend results are available for post-reservoir construction periods within the four trend periods described above to avoid including abrupt changes arising from the construction of larger reservoirs in periods for which gradual monotonic trends are computed. Other abrupt changes, such as changes to water withdrawals and wastewater return flows, or episodic disturbances with multi-year recovery periods, such as wildfires, are not evaluated. Sites with pronounced abrupt changes or other non-monotonic trajectories of change may require more sophisticated trend analyses than those presented in this data release.
Public Available Data Set of Process Flows from Internal Physical...
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Domenico Pagliaro; Domenico Pagliaro; Martin Pleschberger; Martin Pleschberger; Konstantin Schekotihin; Konstantin Schekotihin (2023). Public Available Data Set of Process Flows from Internal Physical Inspections in the Failure Analysis Laboratory [Dataset]. http://doi.org/10.5281/zenodo.10069426
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10069426
Dataset updated
Nov 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Domenico Pagliaro; Domenico Pagliaro; Martin Pleschberger; Martin Pleschberger; Konstantin Schekotihin; Konstantin Schekotihin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set was generated in accordance with the semiconductor industry and contains data of certain process flows in Failure Analysis (FA) laboratories focusing on the identification and analysis of anomalies or malfunctions in semiconductor devices. It comprises logistic data about the processing steps for the so-called Internal Physical Inspection (IPI).
A so-called IPI job is given as a sequence of tasks that must be performed to complete the job they belong to. It has an assigned unique ID and timestamps indicating the submission, the end, and the deadline to be met. A job also has an IPI classification assigned to it, providing general guidelines on the operations to be performed.
Every task within a job has its own type and working time, as well as the assigned resources. There are two main resources involved:
- the equipment; the machine used to perform the task,
- the operator; the person who performed the task.
In addition, general information about the type of the device to be analyzed is also available, such as the given (anonymized) package and basictype. Data also include the number of stressed samples within a device and the samples a task is performed on.
The dataset includes data from 4 years, specifically from January 2020 to December 2022.
Finally, the exact column structure is given as follows (python 3.9.5 datatype):
JOB_ID [int64]: the unique ID of the job
JOB_SUBMISSION_DATE [object]: the date of the job submission
JOB_REQ_END_DATE [object]: the required end date (deadline)
JOB_FINISH_DATE [object]: the actual end date
JOB_BASICTYPE_H [object]: the given basictype denotation
JOB_PACKAGE_H [object]: the package denotation of the device
JSH_QTY_STRESSED [float64]: number of stressed samples
TASK_SUBMISSION_DATE [object]: the date of the task submission
TASK_WORKING_TIME [float64]: the amount of time (hours) the task needs to be completed
TASK_SAMPLE_NO [object]: the samples the task was performed on
TASK_CEQ_ID [float64]: the ID of the machine used to perform the task
TASK_CTKS_ID [int64]: the ID representing the task type
TASK_USR_ID [int64]: the ID of the operator performing the task
CIPI_LEVEL_0 [object]: a series of IPI classifications, indicating what is required to execute for a specific job
STEP Skills Measurement Household Survey 2012 (Wave 1) - Colombia
microdata.worldbank.org
catalog.ihsn.org
+1more
Updated Apr 8, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2016). STEP Skills Measurement Household Survey 2012 (Wave 1) - Colombia [Dataset]. https://microdata.worldbank.org/index.php/catalog/2012
Explore at:
Dataset updated
Apr 8, 2016
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2012
Area covered
Colombia
Description
Abstract

The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.

The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.

Geographic coverage

13 major metropolitan areas: Bogota, Medellin, Cali, Baranquilla, Bucaramanga, Cucuta, Cartagena, Pasto, Ibague, Pereira, Manizales, Monteira, and Villavicencio.

Analysis unit

The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members aged 15 to 64 included. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.

Universe

The target population for the Colombia STEP survey is all non-institutionalized persons 15 to 64 years old (inclusive) living in private dwellings in urban areas of the country at the time of data collection. This includes all residents except foreign diplomats and non-nationals working for international organizations.

The following groups are excluded from the sample: - residents of institutions (prisons, hospitals, etc.) - residents of senior homes and hospices - residents of other group dwellings such as college dormitories, halfway homes, workers' quarters, etc. - persons living outside the country at the time of data collection.

Kind of data

Sample survey data [ssd]

Sampling procedure

Stratified 7-stage sample design was used in Colombia. The stratification variable is city-size category.

First Stage Sample The primary sample unit (PSU) is a metropolitan area. A sample of 9 metropolitan areas was selected from the 13 metropolitan areas on the sample frame. The metropolitan areas were grouped according to city-size; the five largest metropolitan areas are included in Stratum 1 and the remaining 8 metropolitan areas are included in Stratum 2. The five metropolitan areas in Stratum 1 were selected with certainty; in Stratum 2, four metropolitan areas were selected with probability proportional to size (PPS), where the measure of size was the number of persons aged 15 to 64 in a metropolitan area.

Second Stage Sample The second stage sample unit is a Section. At the second stage of sample selection, a PPS sample of 267 Sections was selected from the sampled metropolitan areas; the measure of size was the number of persons aged 15 to 64 in a Section. The sample of 267 Sections consisted of 243 initial Sections and 24 reserve Sections to be used in the event of complete non-response at the Section level.

Third Stage Sample The third stage sample unit is a Block. Within each selected Section, a PPS sample of 4 blocks was selected; the measure of size was the number of persons aged 15 to 64 in a Block. Two sample Blocks were initially activated while the remaining two sample Blocks were reserved for use in cases where there was a refusal to cooperate at the Block level or cases where the block did not belong to the target population (e.g., parks, and commercial and industrial areas).

Fourth Stage Sample The fourth stage sample unit is a Block Segment. Regarding the Block segmentation strategy, the Colombia document 'FINAL SAMPLING PLAN (ARD-397)' states "According to the 2005 population and housing census conducted by DANE, the average number of dwellings per block in the 13 large cities or metropolitan areas was approximately 42 dwellings. Based on this finding, the defined protocol was to report those cases in which 80 or more dwellings were present in a given block in order to partition block using a random selection algorithm." At the fourth stage of sample selection, 1 Block Segment was selected in each selected Block using a simple random sample (SRS) method.

Fifth Stage Sample The fifth stage sample unit is a dwelling. At the fifth stage of sample selection, 5582 dwellings were selected from the sampled Blocks/Block Segments using a simple random sample (SRS) method. According to the Colombia document 'FINAL SAMPLING PLAN (ARD-397)', the selection of dwellings within a participant Block "was performed differentially amongst the different socioeconomic strata that the Colombian government uses for the generation of cross-subsidies for public utilities (in this case, the socioeconomic stratum used for the electricity bill was used). Given that it is known from previous survey implementations that refusal rates are highest amongst households of higher socioeconomic status, the number of dwellings to be selected increased with the socioeconomic stratum (1 being the poorest and 6 being the richest) that was most prevalent in a given block".

Sixth Stage Sample The sixth stage sample unit is a household. At the sixth stage of sample selection, one household was selected in each selected dwelling using an SRS method.

Seventh Stage Sample The seventh stage sample unit was an individual aged 15-64 (inclusive). The sampling objective was to select one individual with equal probability from each selected household.

Sampling methodologies are described for each country in two documents and are provided as external resources: (i) the National Survey Design Planning Report (NSDPR) (ii) the weighting documentation (available for all countries)

Mode of data collection

Face-to-face [f2f]

Research instrument

The STEP survey instruments include:

The background questionnaire developed by the World Bank (WB) STEP team

Reading Literacy Assessment developed by Educational Testing Services (ETS).

All countries adapted and translated both instruments following the STEP technical standards: two independent translators adapted and translated the STEP background questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator.

The survey instruments were piloted as part of the survey pre-test.

The background questionnaire covers such topics as respondents' demographic characteristics, dwelling characteristics, education and training, health, employment, job skill requirements, personality, behavior and preferences, language and family background.

The background questionnaire, the structure of the Reading Literacy Assessment and Reading Literacy Data Codebook are provided in the document "Colombia STEP Skills Measurement Survey Instruments", available in external resources.

Cleaning operations

STEP data management process:

1) Raw data is sent by the survey firm 2) The World Bank (WB) STEP team runs data checks on the background questionnaire data. Educational Testing Services (ETS) runs data checks on the Reading Literacy Assessment data. Comments and questions are sent back to the survey firm. 3) The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4) The WB STEP team and ETS check if the data files are clean. This might require additional iterations with the survey firm. 5) Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6) ETS scales the Reading Literacy Assessment data. 7) The WB STEP team merges the background questionnaire data with the Reading Literacy Assessment data and computes derived variables.

Detailed information on data processing in STEP surveys is provided in "STEP Guidelines for Data Processing", available in external resources. The template do-file used by the STEP team to check raw background questionnaire data is provided as an external resource, too.`

Response rate

An overall response rate of 48% was achieved in the Colombia STEP Survey.
cylistic_trip_data
kaggle.com
zip
Updated Jan 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tracy Nguyen (2022). cylistic_trip_data [Dataset]. https://www.kaggle.com/trnguyen1510/cylistic-trip-data
Explore at:
zip(204750591 bytes)Available download formats
Dataset updated
Jan 31, 2022
Authors
Tracy Nguyen
Description
Context

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path. By the end of this lesson, you will have a portfolio-ready case study.

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Content

The datasets contain the previous 12 months of Cyclistic trip data. The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer business questions.

Acknowledgements

This data has been made available by Motivate International Inc. under this license. This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

Inspiration

Research question: How do annual members and casual riders use Cylistic bikes differently.
d
Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing,...
dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege (2024). Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis (FDPVA) [Dataset]. https://dataone.org/datasets/sha256%3A77e8a13f897db6e70537f800ceea9a2c8228a4eb1b33289b7e2c056845298a12
Explore at:
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege
Description
The objective of the fourth Technical Meeting on Fusion Data Processing, Validation and Analysis was to provide a platform during which a set of topics relevant to fusion data processing, validation and analysis are discussed with the view of extrapolating needs to next step fusion devices such as ITER. The validation and analysis of experimental data obtained from diagnostics used to characterize fusion plasmas are crucial for a knowledge-based understanding of the physical processes governing the dynamics of these plasmas. This paper presents the recent progress and achievements in the domain of plasma diagnostics and synthetic diagnostics data analysis (including image processing, regression analysis, inverse problems, deep learning, machine learning, big data and physics-based models for control) reported at the meeting. The progress in these areas highlight trends observed in current major fusion confinement devices. A special focus is dedicated on data analysis requirements for ITER and DEMO with a particular attention paid to Artificial Intelligence for automatization and improving reliability of control processes.
STEP Skills Measurement Household Survey 2012 (Wave 1) - Viet Nam
microdata.worldbank.org
datacatalog.ihsn.org
+1more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
STEP Skills Measurement Household Survey 2012 (Wave 1) - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/2018
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2012
Area covered
Viet Nam
Description
Abstract

The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.

The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.

Geographic coverage

The survey covers the urban area of two largest cities of Vietnam, Ha Noi and HCMCT.

Analysis unit

The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members aged 15 to 64 included. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.

Universe

The STEP target population is the population aged 15 to 64 included, living in urban areas, as defined by each country's statistical office. In Vietnam, the target population comprised all people from 15-64 years old living in urban areas in Ha Noi and Ho Chi Minh City (HCM).

The reasons for selection of these two cities include :

(i) They are two biggest cities of Vietnam, so they would have all urban characteristics needed for STEP study, and (ii) It is less costly to conduct STEP survey in these to cities, compared to all urban areas of Vietnam, given limitation of survey budget.

The target population is not representative for the national urban population.

The following are excluded from the sample:

Residents of institutions (prisons, hospitals, etc)

Residents of senior homes and hospices

Residents of other group dwellings such as college dormitories, halfway homes, workers' quarters, etc

Persons living outside the country at the time of data collection

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample of 3405 households was selected from 227 urban Enumeration Areas (EAs) in Ha Noi (107 EAs) and Ho Chi Minh City (120 EAs). From each EA 15 households were selected, so the number of households selected in Ha Noi was 1245 HHs, and in HCM, 2160 HHs.

The 2009 Population and Housing Census was used as a sample frame.

Regarding PSUs (EAs), the sampling frame is the list of 15% of total EAs of the 2009 Population Census. Data items on the frame for PSU include provincecode, districtcode, commune code, and EA code; address of EA, number of households.

Regarding ultimate sampling units (households), sampling frame is a list of (100) households in each EA. Data items on the frame for ultimate sampling units (households) include names of heads of households.

The sample frame includes the list of urban EAs and the count of households for each EA. Changes of the EAs list and household list would impact on coverage of sample frame. In a recent review of Ha Noi, there were only 3 EAs either new or destroyed from 140 randomly selected Eas (2%). GSO would increase the coverage of sample frame (>95% as standard) by updating the household list of the selected Eas before selecting households for STEP.

A detailed description of the sample design is available in section 4 of the NSDPR provided with the metadata. On completion of the household listing operation, GSO will deliver to the World Bank a copy of the lists, and an Excel spreadsheet with the total number of households listed in each of the 227 visited PSUs.

Mode of data collection

Face-to-face [f2f]

Research instrument

The STEP survey instruments include: (i) a Background Questionnaire developed by the WB STEP team (ii) a Reading Literacy Assessment developed by Educational Testing Services (ETS).

All countries adapted and translated both instruments following the STEP Technical Standards: 2 independent translators adapted and translated the Background Questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator. The WB STEP team and ETS collaborated closely with the survey firms during the process and reviewed the adaptation and translation to Vietnamese (using a back translation). - The survey instruments were both piloted as part of the survey pretest. - The adapted Background Questionnaires are provided in English as external resources. The Reading Literacy Assessment is protected by copyright and will not be published.

Cleaning operations

STEP Data Management Process 1. Raw data is sent by the survey firm 2. The WB STEP team runs data checks on the Background Questionnaire data. - ETS runs data checks on the Reading Literacy Assessment data. - Comments and questions are sent back to the survey firm. 3. The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4. The WB STEP team and ETS check the data files are clean. This might require additional iterations with the survey firm. 5. Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6. ETS scales the Reading Literacy Assessment data. 7. The WB STEP team merges the Background Questionnaire data with the Reading Literacy Assessment data and computes derived variables.

Detailed information data processing in STEP surveys is provided in the 'Guidelines for STEP Data Entry Programs' document provided as an external resource. The template do-file used by the STEP team to check the raw background questionnaire data is provided as an external resource.

Response rate

The response rate for Vietnam (urban) was 62%. (See STEP Methodology Note Table 4).

Sampling error estimates

A weighting documentation was prepared for each participating country and provides some information on sampling errors. All country weighting documentations are provided as an external resource.
Data from: APPLYING MACHINE LEARNING METHODS AND TIME SERIES ANALYSIS TO...
ecat.ga.gov.au
Updated Jan 9, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Commonwealth of Australia (Geoscience Australia) (2013). APPLYING MACHINE LEARNING METHODS AND TIME SERIES ANALYSIS TO CREATE A NATIONAL DYNAMIC LAND COVER DATASET FOR AUSTRALIA [Dataset]. https://ecat.ga.gov.au/geonetwork/srv/api/records/d2d443be-b246-0328-e044-00144fdd4fa6
Explore at:
Dataset updated
Jan 9, 2013
Dataset provided by
Geoscience Australiahttp://ga.gov.au/
EGD
Area covered
Australia
Description
The National Dynamic Land Cover Dataset (DLCD) classifies Australian land cover into 34 categories, which conform to 2007 International Standards Organisation (ISO) Land Cover Standard (19144-2). The DLCD has been developed by Geoscience Australia and the Australian Bureau of Agricultural and Resource Economics and Sciences (ABARES), aiming to provide nationally consistent land cover information to federal and state governments and general public. This paper describes machine learning techniques and statistical modeling methods developed to generate DLCD from earth observation data. MODIS (Moderate Resolution Imaging Spectroradiometer) 250m EVI (Enhanced Vegetation Index) time series data from year 2000 to year 2008 is the main data source for the modeling process, which consists of 3 steps. In the first step, noisy and invalid data points are removed from the time series. Secondly, a feature extraction algorithm converts time series into a set of 12 time series coefficients related to ground phenomenon such as average greenness and plant phenology. At the last step, clustering processes based on a tailored support vector clustering algorithm are applied on subsets of the coefficients. The resultant clusters then form the bases of further modeling process incorporating auxiliary data to generate final DLCD.
a
Global Trends
arc-gis-hub-home-arcgishub.hub.arcgis.com
Updated Apr 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Wide Fund for Nature (2020). Global Trends [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/maps/panda::global-trends
Explore at:
Dataset updated
Apr 17, 2020
Dataset authored and provided by
World Wide Fund for Nature
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
South Pacific Ocean, Pacific Ocean
Description
WWF developed a global analysis of the world's most important deforestation areas or deforestation fronts in 2015. This assessment was revised in 2020 as part of the WWF Deforestation Fronts Report.Emerging Hotspots analysisThe goal of this analysis was to assess the presence of deforestation fronts: areas where deforestation is significantly increasing and is threatening remaining forests. We selected the emerging hotspots analysis to assess spatio-temporal trends of deforestation in the pan-tropics.Spatial UnitWe selected hexagons as the spatial unit for the hotspots analysis for several reasons. They have a low perimeter-to-area ratio, straightforward neighbor relationships, and reduced distortion due to curvature of the earth. For the hexagon size we decided on a unit of 1,000 ha, based on the resolution of the deforestation data (250m) meant that we could aggregate several deforestation events inside units over time. Hexagons that are closer to or equal to the size of a deforestation event means there could only be one event before the forest is gone and limit statistical analysis.We processed over 13 million hexagons for this analysis and limited the emerging hotspots analysis to only hexagons with at least 15% forest cover remaining (from the all-evidence forest map). This prevented including hotspots in agricultural areas or where all forest has been converted.OutputsThis analysis uses the Getis-Ord and Mann-Kendall statistics to identify spatial clusters of deforestation which have a non-parametric significant trend across a time series. The spatial clusters are defined by the spatial unit and a temporal neighborhood parameter. We use a neighborhood parameter of 5km to include spatial neighbors in the hotspots assessment and time slices for each country described below. Deforestation events are summarized by a spatial unit (hexagons described below) and the results comprise a trends assessment which defines increasing or decreasing deforestation in the units determined at 3 different confidence intervals (90%, 95% and 99%) and the spatio-temporal analysis classifying areas into 8 hot unique or cold spot categories. Our analysis identified 7 hotspot categories:Hotspot TypeDefinitionNewA location with a statistically significant increasing hotspots only in the final time stepConsecutiveAn uninterrupted run of statistically significant hotspot in the final time-steps IntensifyingA statistically significant hotspot for >90% of the bins, including the final time stepPersistentA statistically significant hotspot for >90% of the bins with no upward or downward trend in clustering intensityDiminishingA statistically significant hotspot for >90% of the time steps, with where the clustering is decreasing, or the most recent time step is not hot.SporadicA on-again then off-again hotspot where <90% of the time-step intervals have been statistically significant hot spots and none have been statistically significant cold spots.HistoricalAt least ninety percent of the time-step intervals have been statistically significant hot spots, with the exception of the final time steps..For the evaluation of spatio-temporal trends of tropical deforestation we selected the Terra-i deforestation dataset to define the temporal deforestation patterns. Terra-i is a freely available monitoring system derived from the analysis of MODIS (NVDI) and TRMM (rainfall) data which are used to assess forest cover changes due to anthropic interventions at a 250 m resolution [ref]. It was first developed for Latin American countries in 2012, and then expanded to pan-tropical countries around the world. Terra-i has generated maps of vegetation loss every 16 days, since January 2004. This relatively high temporal resolution of twice monthly observations allows for a more detailed emerging hotspots analysis, increasing the number of time steps or bins available for assessing spatio-temporal patterns relative to annual datasets. Next, the spatial resolution of 250m is more relevant for detecting forest loss than changes in individual tree cover or canopies and is better adapted to process trends on large scales. Finally, the added value of the Terra-i algorithm is that it employs an additional neural network machine learning to identify vegetation loss that is due to anthropic causes as opposed to natural events or other causes. Our dataset comprised all Terra-i deforestation events observed between 2004 and 2017. Temporal unitThe temporal unit or time slice was selected for each country according to the distribution of data. The deforestation data comprised 16-day periods between 2004 and 2017 for a total of 312 potential observation time periods. These were aggregated to time bins to overcome any seasonality in the detection of deforestation events (due to clouds). The temporal unit is combined with the spatial parameter (i.e. 5km) to create the space-time bins for hotspot analysis. For dense time series or countries with a lot of deforestation events (i.e. Brazil) a smaller time slice was used (i.e. 3 months, n=54) with a neighborhood interval of 8 months, meaning that the previous year and next year together were combined to assess statistical trends relative to the global variables together. The rule we employed was that the time slice x neighborhood interval was equal to 24 months, or 2 years, in order to look at general trends over the entire time period and prevent the hotspots analysis from being biased to short time intervals of a few months.Deforestation FrontsFinally, using trends and hotpots we identify 24 major deforestation fronts, areas of significantly increasing deforestation and the focus of WWF's call for action to slow deforestation.
Data from: Supplementary Material for "Sonification for Exploratory Data...
search.datacite.org
pub.uni-bielefeld.de
Updated Feb 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. http://doi.org/10.4119/unibi/2920448
Explore at:
Unique identifier
https://doi.org/10.4119/unibi/2920448
Dataset updated
Feb 5, 2019
Dataset provided by
DataCitehttps://www.datacite.org/
Bielefeld University
Authors
Thomas Hermann
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Sonification for Exploratory Data Analysis #### Chapter 8: Sonification Models In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data. ##### 8.1 Data Sonograms Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space. * Table 8.2, page 87: Sound examples for Data Sonograms File: Iris dataset: started in plot (a) at S0 (b) at S1 (c) at S2
10d noisy circle dataset: started in plot (c) at S0 (mean) (d) at S1 (edge)
10d Gaussian: plot (d) started at S0
3 clusters: Example 1
3 clusters: invisible columns used as output variables: Example 2 Description: Data Sonogram Sound examples for synthetic datasets and the Iris dataset Duration: about 5 s ##### 8.2 Particle Trajectory Sonification Model This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset. * Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x). * Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster. * Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters * Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster * Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step. * Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step. * Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset. ##### 8.3 Markov chain Monte Carlo Sonification The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound. * Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes. * Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset * McMC Sonification for Cluster Analysis, dataset with three clusters, page 107 * Stream 1 MCMC-Ex-3.1 * Stream 2 MCMC-Ex-3.2 * Stream 3 MCMC-Ex-3.3 * Mix MCMC-Ex-3.4 * McMC Sonification for Cluster Analysis, dataset with three clusters, T =0.002s, page 107 * Stream 1 MCMC-Ex-4.1 (stream 1) * Stream 2 MCMC-Ex-4.2 (stream 2) * Stream 3 MCMC-Ex-4.3 (stream 3) * Mix MCMC-Ex-4.4 * McMC Sonification for Cluster Analysis, density with 6 modes, T=0.008s, page 107 * Stream 1 MCMC-Ex-5.1 (stream 1) * Stream 2 MCMC-Ex-5.2 (stream 2) * Stream 3 MCMC-Ex-5.3 (stream 3) * Mix MCMC-Ex-5.4 * McMC Sonification for the Iris dataset, page 108 * MCMC-Ex-6.1 * MCMC-Ex-6.2 * MCMC-Ex-6.3 * MCMC-Ex-6.4 * MCMC-Ex-6.5 * MCMC-Ex-6.6 * MCMC-Ex-6.7 * MCMC-Ex-6.8 ##### 8.4 Principal Curve Sonification Principal Curve Sonification represents data by synthesizing the soundscape while a virtual listener moves along the principal curve of the dataset through the model space. * Noisy Spiral dataset, PCS-Ex-1.1 , page 113 * Noisy Spiral dataset with variance modulation PCS-Ex-1.2 , page 114 * 9d tetrahedron cluster dataset (10 clusters) PCS-Ex-2 , page 114 * Iris dataset, class label used as pitch of auditory grains PCS-Ex-3 , page 114 ##### 8.5 Data Crystallization Sonification Model * Table 8.6, page 122: Sound examples for Crystallization Sonification for 5d Gaussian distribution File: DCS started at center, in tail, from far outside Description: DCS for dataset sampled from N{0, I_5} excited at different locations Duration: 1.4 s * Mixture of 2 Gaussians, page 122 * DCS started at point A DCS-Ex1A * DCS started at point B DCS-Ex1B * Table 8.7, page 124: Sound examples for DCS on variation of the harmonics factor File: h_omega = 1, 2, 3, 4, 5, 6 Description: DCS for a mixture of two Gaussians with varying harmonics factor Duration: 1.4 s * Table 8.8, page 124: Sound examples for DCS on variation of the energy decay time File: tau_(1/2) = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 Description: DCS for a mixture of two Gaussians varying the energy decay time tau_(1/2) Duration: 1.4 s * Table 8.9, page 125: Sound examples for DCS on variation of the sonification time File: T = 0.2, 0.5, 1, 2, 4, 8 Description: DCS for a mixture of two Gaussians on varying the duration T Duration: 0.2s -- 8s * Table 8.10, page 125: Sound examples for DCS on variation of model space dimension File: selected columns of the dataset: (x0) (x0,x1) (x0,...,x2) (x0,...,x3) (x0,...,x4) (x0,...,x5) Description: DCS for a mixture of two Gaussians varying the dimension Duration: 1.4 s * Table 8.11, page 126: Sound examples for DCS for different excitation locations File: starting point: C0, C1, C2 Description: DCS for a mixture of three Gaussians in 10d space with different rank(S) = {2,4,8} Duration: 1.9 s * Table 8.12, page 126: Sound examples for DCS for the mixture of a 2d distribution and a 5d cluster File: condensation nucleus in (x0,x1)-plane at: (-6,0)=C1, (-3,0)=C2, ( 0,0)=C0 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s * Table 8.13, page 127: Sound examples for DCS for the cancer dataset File: condensation nucleus in (x0,x1)-plane at: benign 1, benign 2
malignant 1, malignant 2 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s ##### 8.6 Growing Neural Gas Sonification * Table 8.14, page 133: Sound examples for GNGS Probing File: Cluster C0 (2d): a, b, c
Cluster C1 (4d): a, b, c
Cluster C2 (8d): a, b, c Description: GNGS for a mixture of 3 Gaussians in 10d space Duration: 1 s * Table 8.15, page 134: Sound examples for GNGS for the noisy spiral dataset File: (a) GNG with 3 neurons 1, 2
(b) GNG with 20 neurons end, middle, inner end
(c) GNG with 45 neurons outer end, middle, close to inner end, at inner end
(d) GNG with 150 neurons outer end, in the middle, inner end
(e) GNG with 20 neurons outer end, in the middle, inner end
(f) GNG with 45 neurons outer end, in the middle, inner end Description: GNG probing sonification for 2d noisy spiral dataset Duration: 1 s * Table 8.16, page 136: Sound examples for GNG Process Monitoring Sonification for different data distributions File: Noisy spiral with 1 rotation: sound
Noisy spiral with 2 rotations: sound
Gaussian in 5d: sound
Mixture of 5d and 2d distributions: sound Description: GNG process sonification examples Duration: 5 s #### Chapter 9: Extensions #### In this chapter, two extensions for Parameter Mapping

Facebook

Twitter

Click to copy link

Link copied

Cite

M-Farheen (2024). cyclistic dataset [Dataset]. https://www.kaggle.com/datasets/dsnerd00/cyclistic-dataset/suggestions?status=pending

cyclistic dataset

Google Data Analytic capstone Project

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 15, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

M-Farheen

Description

Google Data Analytics Capstone Project

Cyclistic Dataset

Case Study: How Does a Bike-Share Navigate Speedy Success?

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams.

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Data overview

ride_id: It is a distinct identifier assigned to each individual ride. rideable_type: This column indicates the type of bikes used for each ride. started_at: This column denotes the timestamp when a particular ride began. ended_at: This column represents the timestamp when a specific ride concluded. start_station_name: This column contains the name of the station where the bike ride originated. start_station_id: This column represents the unique identifier for the station where the bike ride originated. end_station_name: This column contains the name of the station where the bike ride concluded. end_station_id: This column represents the unique identifier for the station where the bike ride concluded. start_lat: This column denotes the latitude coordinate of the starting point of the bike ride. start_lng: This column denotes the longitude coordinate of the starting point of the bike ride. end_lat: This column denotes the latitude coordinate of the ending point of the bike ride. end_lng: This column denotes the longitude coordinate of the ending point of the bike ride. member_casual: This column indicates whether the rider is a member or a casual user.

Clear search

Close search

Google apps

Main menu

cyclistic dataset

Google Data Analytics Capstone Project

Cyclistic Dataset

Case Study: How Does a Bike-Share Navigate Speedy Success?

Scenario

Data overview

Data Science Platform Market Analysis North America, Europe, APAC, South...

Snapshot img

Google Data Analytics Capstone

Data from: A protocol for conducting and presenting results of...

Data from: pmartR: Quality Control and Statistics for Mass...

Data analysis steps for each package in SDA-V2.

Global Manufacturing Analytics Market Size By Component Type (Software,...

‘Main steps in the thermal refurbishment process ’ analyzed by Analyst-2

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Data from: An Automated Data Analysis Pipeline for GC−TOF−MS Metabonomics...

Data from: The role of Data Science and AI for predicting the decline of...

Long-term monotonic trends in annual and monthly streamflow metrics at...

Public Available Data Set of Process Flows from Internal Physical...

STEP Skills Measurement Household Survey 2012 (Wave 1) - Colombia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

cylistic_trip_data

Context

Content

Acknowledgements

Inspiration

Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing,...

STEP Skills Measurement Household Survey 2012 (Wave 1) - Viet Nam

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data from: APPLYING MACHINE LEARNING METHODS AND TIME SERIES ANALYSIS TO...

Global Trends

Data from: Supplementary Material for "Sonification for Exploratory Data...

cyclistic dataset

Google Data Analytic capstone Project

Google Data Analytics Capstone Project

Cyclistic Dataset

Case Study: How Does a Bike-Share Navigate Speedy Success?

Scenario

Data overview