51 datasets found

Customer Churn - Decision Tree & Random Forest
kaggle.com
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2023
Dataset provided by
Kaggle
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Main objective: Find out customers who will churn and who will not.

Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.

Steps Involved

Read the data

Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">

Change character vector to factor vector as this is as classification problem

Drop the variable which is not significant for the analysis. We drop "customerID".

Check for missing values. None are found.

Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.

Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)

Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

Tuning the model

Define the search grid using the expand.grid function

Set up the control parameters through 5 fold cross validation

When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

Predict the model

Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

USE RANDOM FOREST

Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".

Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

Predict the model and create a new data frame showing the actuals vs predicted values

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

Tune the model mtry=2 has the lowest OOB error rate

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

Use random forest with mtry = 2 and ntree = 200

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...
Data Set for Predicting the Performance of ATL Model Transformations Based...
zenodo.org
pdf, zip
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raffaela Groner; Raffaela Groner; Peter Bellmann; Peter Bellmann; Stefan Höppner; Stefan Höppner; Patrick Thiam; Patrick Thiam; Friedhelm Schwenker; Friedhelm Schwenker; Hans A. Kestler; Hans A. Kestler; Matthias Tichy; Matthias Tichy (2024). Data Set for Predicting the Performance of ATL Model Transformations Based on Generated Models [Dataset]. http://doi.org/10.5281/zenodo.10395170
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10395170
Dataset updated
Jul 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Raffaela Groner; Raffaela Groner; Peter Bellmann; Peter Bellmann; Stefan Höppner; Stefan Höppner; Patrick Thiam; Patrick Thiam; Friedhelm Schwenker; Friedhelm Schwenker; Hans A. Kestler; Hans A. Kestler; Matthias Tichy; Matthias Tichy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting the execution time of model transformations can help to understand how a transformation reacts to a given input model without creating and transforming the respective model.

In our previous data set (https://doi.org/10.5281/zenodo.8385957), we have documented our experiments in which we predict the performance of ATL transformations using predictive models obtained from training linear regression, random forest and support vector regression. As input for the prediction, our approach uses a characterization of the input model. In these experiments, we only used data from real models.

However, a common problem is that transformation developers do not have enough models available to use such a prediction approach. Therefore, in a new variant of our experiments, we investigated whether the three considered machine learning approaches can predict the performance of transformations if we use data from generated models for training. We also investigated whether it is possible to achieve good predictions with smaller training data. The dataset provided here offers the corresponding raw data, scripts, and results.

A detailed documentation is available in documentaion.pdf.
Global Ensemble Digital Terrain Model 30m (GEDTM30)
data.europa.eu
zenodo.org
unknown
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Global Ensemble Digital Terrain Model 30m (GEDTM30) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15490367
Explore at:
unknown(574776)Available download formats
Dataset updated
May 22, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disclaimer This is the first release of the Global Ensemble Digital Terrain Model (GEDTM30). Use for testing purposes only. A publication describing the methods used has been submitted to PeerJ and is currently under review. This work was funded by the European Union. However, the views and opinions expressed are solely those of the author(s) and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. The data is provided "as is." The Open-Earth-Monitor project consortium, along with its suppliers and licensors, hereby disclaims all warranties of any kind, express or implied, including, without limitation, warranties of merchantability, fitness for a particular purpose, and non-infringement. Neither the Open-Earth-Monitor project consortium nor its suppliers and licensors make any warranty that the website will be error-free or that access to it will be continuous or uninterrupted. You understand that you download or otherwise obtain content or services from the website at your own discretion and risk. Description GEDTM30 is presented as a 1-arc-second (~30m) global Digital Terrain Model (DTM) generated using machine-learning-based data fusion. It was trained using a global-to-local Random Forest model with ICESat-2 and GEDI data, incorporating almost 30 billion high-quality points. To see the documentation, please visit our GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). This dataset covers the entire world and can be used for applications such as topography, hydrology, and geomorphometry analysis. Dataset Contents This dataset includes: GEDTM30Represents the predicted terrain height. Uncertainty of GEDTM30 predictionProvides an uncertainty map of the terrain prediction, derived from the standard deviation of individual tree predictions in the Random Forest model. Due to Zenodo's storage limitations, the original GEDTM30 dataset and its standard deviation map are provided via external links: GEDTM30 30m Uncertainty of GEDTM30 prediction 30m Related Identifiers Landform:Slope in Degree, Geomorphons Light and Shadow:Positive Openness, Negative Openness, Hillshade Curvature:Minimal Curvature, Maximal Curvature, Profile Curvature, Tangential Curvature, Ring Curvature, Shape Index Local Topographic Position:Difference from Mean Elevation, Spherical Standard Deviation of the Normals Hydrology:Specific Catchment Area, LS Factor, Topographic Wetness Index Data Details Time period: static. Type of data: Digital Terrain Model How the data was collected or derived: Machine learning models. Statistical Methods used: Random Forest. Limitations or exclusions in the data: The dataset does not include data Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180, -65, 180, 85) Spatial resolution: 120m Image size: 360,000P x 178,219L File format: Cloud Optimized Geotiff (COG) format. Layer information: Layer Scale Data Type No Data Ensemble Digital Terrain Model 10 Int32 -2,147,483,647 Standard Deviation EDTM 100 UInt16 65,535 Code Availability The primary development of GEDTM30 is documented in GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). The current version (v1) code is compressed and uploaded as GEDTM30-main.zip. To access the up-to-date development please visit our GitHub page. Support If you discover a bug, artifact, or inconsistency, or if you have a question please raise a GitHub issue here Naming convention To ensure consistency and ease of use across and within the projects, we follow the standard Ai4SoilHealth and Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describe important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. For example, for edtm_rf_m_120m_s_20000101_20231231_go_epsg.4326_v20250130.tif, the fields are: generic variable name: edtm = ensemble digital terrain model variable procedure combination: rf = random forest Position in the probability distribution/variable type: m = mean | sd = standard deviation Spatial support: 120m Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20231231 = 2023-12-31 Bounding box: go = global EPSG code: EPSG:4326 Version code: v20250130 = version from 2025-01-30
f
Data from: A Bayesian hybrid method for the analysis of generalized linear...
tandf.figshare.com
pdf
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sezgin Ciftci; Zeynep Kalaylioglu (2025). A Bayesian hybrid method for the analysis of generalized linear models with missing-not-at-random covariates [Dataset]. http://doi.org/10.6084/m9.figshare.27244867.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27244867.v1
Dataset updated
Jun 1, 2025
Dataset provided by
Taylor & Francis
Authors
Sezgin Ciftci; Zeynep Kalaylioglu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data handling is one of the main problems in modelling, particularly if the missingness is of type missing-not-at-random (MNAR) where missingness occurs due to the actual value of the observation. The focus of the current article is generalized linear modelling of fully observed binary response variables depending on at least one MNAR covariate. For the traditional analysis of such models, an individual model for the probability of missingness is assumed and incorporated in the model framework. However, this probability model is untestable, as the missingness of MNAR data depend on their actual values that would have been observed otherwise. In this article, we consider creating a model space that consist of all possible and plausible models for probability of missingness and develop a hybrid method in which a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm is combined with Bayesian Model Averaging (BMA). RJMCMC is adopted to obtain posterior estimates of model parameters as well as probability of each model in the model space. BMA is used to synthesize coefficient estimates from all models in the model space while accounting for model uncertainties. Through a validation study with a simulated data set and a real data application, the performance of the proposed methodology is found to be satisfactory in accuracy and efficiency of estimates.
f
Data from: Novel Aggregate Deletion/Substitution/Addition Learning...
tandf.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam B. Olshen; Robert L. Strawderman; Gregory Ryslik; Karen Lostritto; Alice M. Arnold; Annette M. Molinaro (2023). Novel Aggregate Deletion/Substitution/Addition Learning Algorithms for Recursive Partitioning [Dataset]. http://doi.org/10.6084/m9.figshare.4892000.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4892000.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Adam B. Olshen; Robert L. Strawderman; Gregory Ryslik; Karen Lostritto; Alice M. Arnold; Annette M. Molinaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many complex diseases are caused by a variety of both genetic and environmental factors acting in conjunction. To help understand these relationships, nonparametric methods that use aggregate learning have been developed such as random forests and conditional forests. Molinaro et al. (2010) described a powerful, single model approach called partDSA that has the advantage of producing interpretable models. We propose two extensions to the partDSA algorithm called bagged partDSA and boosted partDSA. These algorithms achieve higher prediction accuracies than individual partDSA objects through aggregating over a set of partDSA objects. Further, by using partDSA objects in the ensemble, each base learner creates decision rules using both “and” and “or” statements, which allows for natural logical constructs. We also provide four variable ranking techniques that aid in identifying the most important individual factors in the models. In the regression context, we compared bagged partDSA and boosted partDSA to random forests and conditional forests. Using simulated and real data, we found that bagged partDSA had lower prediction error than the other methods if the data were generated by a simple logic model, and that it performed similarly for other generating mechanisms. We also found that boosted partDSA was effective for a particularly complex case. Taken together these results suggest that the new methods are useful additions to the ensemble learning toolbox. We implement these algorithms as part of the partDSA R package. Supplementary materials for this article are available online.
n
Discounting future reward in an uncertain world: behavioural data
data.niaid.nih.gov
datadryad.org
zip
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giles Story (2023). Discounting future reward in an uncertain world: behavioural data [Dataset]. http://doi.org/10.5061/dryad.47d7wm3k2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.47d7wm3k2
Dataset updated
May 11, 2023
Dataset provided by
University College London
Authors
Giles Story
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Humans discount delayed relative to more immediate reward. A plausible explanation is that impatience arises partly from uncertainty, or risk, implicit in delayed reward. Existing theories of discounting-as-risk focus on a probability that delayed reward will not materialize. By contrast, we examine how uncertainty in the magnitude of delayed reward contributes to delay discounting. We propose a model wherein reward is discounted proportional to the rate of random change in its magnitude across time, termed volatility. We find evidence to support this model across three experiments (total N=158). Firstly, using a task where participants chose when to sell products, whose price dynamics they previously learned, we show discounting increases in line with price volatility. Secondly, we show that this effect pertains over naturalistic delays of up to four months. Using functional magnetic resonance imaging, we observe a volatility-dependent decrease in functional hippocampal-prefrontal coupling during intertemporal choice. Thirdly, we replicate these effects in a larger online sample, finding that volatility discounting within each task correlates with baseline discounting outside of the task. We conclude that delay discounting partly reflects time-dependent uncertainty about reward magnitude, i.e. volatility. Our model captures how discounting adapts to volatility, thereby partly accounting for individual differences in impatience. Our imaging findings suggest a putative mechanism whereby uncertainty reduces prospective simulation of future outcomes. Methods Experiment 1 In Experiment 1 participants were briefed to imagine that they owned a farming business, selling produce to the highest bidder in a marketplace. Participants learned how the prices of three different products (wheat, chicken and beans) evolved week-by-week, where a week corresponded to a trial of the experiment (Figure 2). The three products had different levels of volatility in price evolution. Participants subsequently made intertemporal choices about when to sell each product, either immediately for a guaranteed price or in the marketplace following a delay. Methods Participant Recruitment and Sample Size This experiment was designed as a pilot, and thereby focused on testing for larger, within participant, effects. Participants were recruited from the UCL Institute of Cognitive Neuroscience subject database. 20 participants (mean age 27.4 years, s. d. 6.9 years; 9 female) completed the experiment. Baseline Discounting Prior to the main task we elicited discount functions for riskless quantities of money. Participants were required to indicate the smallest immediate monetary reward, termed their indifference amount, that they would be willing to accept instead of a larger stated quantity of money (£8, £9, £11 or £12) to be received at a specified delay (1, 2, 4, 26 or 52 weeks). Each delay was presented twice for each larger reward amount, creating 40 choices in total. One choice was selected to be paid for real, at the stated delay, in post-dated Amazon vouchers. To achieve this in an incentive-compatible manner, for the selected choice, we randomly selected an immediate reward from a uniform distribution between £0 and the magnitude of the larger reward (e.g., £12); if this amount was below or equal to the participant’s stated indifference point, they received the delayed reward, if above the indifference point they received the randomly-drawn immediate reward. Participants were fully briefed on this procedure. Three participants who answered £0 in response to all baseline questions were excluded from this analysis. Learning Price Dynamics During the task, participants observed and predicted the price of each product, displayed on a linear scale ranging from £0 to £25, as it evolved over the course of 240 trials. Each trial of the experiment was described as a ‘week’. After passively observing prices over several ‘weeks’ (trials), participants were asked to predict upcoming prices one week ahead; the task therefore involved both observational and instrumental learning. Participants were instructed about two sources of variability in prices: Gaussian emission noise, applying equally to all products, which we described as ‘variability in bidding’, and changes in the underlying ‘market price’. For one of the three products (‘No Volatility’) the market price was held constant; the market price of the other two products (‘Low Volatility’ and ‘High Volatility’) underwent random changes across time, with the same Gaussian emission noise. We used two predefined sequences of outcomes for each product; participants were then allocated at random to one of the two sequences. We estimated learning rates for the three products separately by fitting a Rescorla-Wagner learning model (Rescorla & Rescorla, 1967) to participants’ price predictions from the first block of 70 prediction trials. Intertemporal Choice Procedure At three points during each block, participants were asked to predict the market price further into the future, at delays of 1, 4, 7, 12 or 18 weeks. Participants subsequently chose when to sell the product, either immediately for a fixed price (x), or on the market after a stated delay (1, 4, 7, 12 or 18 weeks). Specifically, they were asked to indicate the smallest fixed price that would just tempt them away from selling on the market. Participants were informed that the future price would evolve according to the same process they had previously observed, and was also subject to the same Gaussian emission noise. By contrast, the immediate price was fixed, with no objective risk. Participants were informed that, after the experiment, we would select one of their choices to be paid out for real. To realise this in an incentive-compatible manner, for the selected choice, we randomly selected an immediate fixed price from a uniform distribution between £0 and £25; if this amount was below the participant’s stated indifference point, they received the simulated future market price for the product as a bonus payment. If the selected price was above the participant’s indifference point they received the randomly-drawn fixed price. All bonus payments were made on the same day, at the end of the experiment. Trial Structure of Learning Phase For a ‘No Volatility’ product the market price was held constant. The market price of the other two products (‘Low Volatility’ and ‘High Volatility’) underwent random changes across time. Price trajectories for these two products were simulated by implementing a time-dependent probability that the market price would change to a new value, selected from a uniform distribution between specified bounds. For a ‘Low Volatility’ product, changes in the market price were small, while for a ‘High Volatility’ product, changes were more extreme. Within each block, participants performed three phases of observation and prediction: the first consisted of 70 observation trials followed by 70 prediction trials, while the subsequent two phases each consisted of 45 observation trials and 5 prediction trials. After each phase the price evolution was paused whilst participants made a set of intertemporal choices. Learning rates were fitted based on the first 70 prediction trials; subsequent prediction phases were included to ensure that participants attended to prices before making intertemporal choices. Experiment 2 Experiment 2 tested whether the effects observed in Experiment 1 replicated in a larger sample, and also probed neural correlates of volatility discounting. Here, to test whether effects of volatility extend to timescales used in conventional discounting tasks, we superimposed the timescale of the task onto longer delays. Specifically, one actual intertemporal choice was selected to be paid out at the stated delay, in the order of weeks. To further test the veridicality of the model, we measured risk aversion outside the main task, and elicited participants’ subjective estimates of future uncertainty within-task. Methods Learning Phase Participants learned price dynamics according to a similar procedure as described for Experiment 1. Here only two products were used, to simplify the neuroimaging analysis. For one of the two products (‘Stable’) the market price was held constant at £25, and participants were explicitly informed about this; the market price of the other product (‘Volatile’) evolved according to a Gaussian random walk, with zero mean drift and volatility σ=3.5, upper bounded at £50 and lower bounded at £0. We used two predefined sequences of outcomes sampled from a random walk with these properties; participants were then allocated at random to one of the two sequences. Participants first passively observed the price of each product, displayed on a linear scale ranging from £0 to £50, as it evolved over the course of 240 trials. Over a further 240 trials they were asked to predict upcoming prices. Prices for the two products were displayed in randomly ordered mini-blocks of 60 trials in length; at the start of each block the market price was reset to £25. Price predictions followed the same procedure as in Experiment 1. For the Stable product, participants were instructed the future market price would remain constant at £25, whereas for the Volatile product the future market price would drift according to the very same process they had previously observed. In both conditions, future prices were also subject to the same degree of emission noise. Description of Emission Noise in the Learning Phase During the learning phase, participants were explicitly instructed about two sources of variability in prices: an irreducible Gaussian noise (?=2) applying equally to both items, which we described as ‘variability in online bidding’, and drift in the underlying ‘market price’. To facilitate this explanation, in a
n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Nagoya University
Osaka University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
d
Data-driven model script for flood severity modeling in Norfolk, VA
search.dataone.org
hydroshare.org
+1more
Updated Apr 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Sadler (2022). Data-driven model script for flood severity modeling in Norfolk, VA [Dataset]. http://doi.org/10.4211/hs.712cd2ce8f604c8f824d6836ee3fcb53
Explore at:
Unique identifier
https://doi.org/10.4211/hs.712cd2ce8f604c8f824d6836ee3fcb53
Dataset updated
Apr 15, 2022
Dataset provided by
Hydroshare
Authors
Jeff Sadler
Time period covered
Jan 1, 2010 - Nov 1, 2016
Area covered

Description
This is a script written in the R programming language. The script is used to train and apply two data-driven models, Random Forest and Poisson regression. The target variable is the number of flood reports per storm event in Norfolk, VA USA. The input variables for the models are environmental conditions on an event time scale (or daily if no flood reports were made for an event). This script was used to produce results published in a paper in the Journal of Hydrology: https://doi.org/10.1016/j.jhydrol.2018.01.044.

Original run configurations: R version = 3.3.3 Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Packages used: 'randomForest' (version 4.6-12) 'caret' (version 6.0-73)
n
Data from: Decreased selectivity during mate choice in a small-sized...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+2more
zip
Updated Sep 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joël Bried; Malvina Andris; Marie-Pierre Dubois; Philippe Jarne (2021). Decreased selectivity during mate choice in a small-sized population of a long-lived seabird [Dataset]. http://doi.org/10.5061/dryad.k0p2ngf8w
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.k0p2ngf8w
Dataset updated
Sep 10, 2021
Dataset provided by
,
Universidade dos Açores
Université Paul-Valéry Montpellier
Authors
Joël Bried; Malvina Andris; Marie-Pierre Dubois; Philippe Jarne
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
As biparental care is crucial for breeding success in Procellariiformes seabirds (i.e., albatrosses and petrels), these species are expected to be choosy during pair formation. However, the choice of partners is limited in small-sized populations, which might lead to random pairing. In Procellariiformes, the consequences of such limitations for mating strategies have been examined in a single species. Here, we studied mate choice in another Procellariiforme, Bulwer’s petrel Bulweria bulwerii, in the Azores (ca 70 breeding pairs), where the species has suffered a dramatic population decline. We based our approach on both a 11-year demographic survey (capture-mark-recapture) and a genetic approach (microsatellites, n = 127 individuals). The genetic data suggest that this small population is not inbred and did not experience a genetic bottleneck. Moreover, pairing occurred randomly with respect to genetic relatedness, we detected no extrapair parentage (n = 35 offspring), and pair fecundity was unrelated to relatedness between partners. From our demographic survey, we detected no assortative mating with respect to body measurements and breeding experience and observed very few divorces, most of which were probably forced. This contrasts with the pattern previously observed in the much larger population from the Selvagens archipelago (assortative mating with respect to bill size and high divorce rate). We suggest that the Bulwer’s petrels from the Azores pair with any available partner and retain it as long as possible despite the fact that reproductive performance did not improve with pair common experience, possibly to avoid skipping breeding years in case of divorce. We recommend determining whether decreased choosiness during mate choice also occurs in reduced populations of other Procellariiform species. This might have implications for the conservation of small threatened seabird populations.

Methods Field work was conducted on Vila islet, Santa Maria island, Azores archipelago, from 2002 to 2012 included. Adults were captured in their nesting burrows each year during incubation, and ringed for identification. Chicks were ringed before fledging. These capture-mark-recapture sessions enabled us to know the life-history of each ringed individual, year after year, that is, the nest it was occupying (nesting cavities were marked with individual numbers), whether or not it was breeding, the outcomes of its breeding attempts, the identity of its social partner(s) and its offspring. Adults were measured (wing length using a stopped ruler to the nearest mm; tarsus length, culmen length and bill depth at the gonys using a vernier calliper to the nearest 0.1 mm).

Blood samples (50-100 µl) were collected from adults upon their first capture in 2002, 2003 and 2004. . Chicks were sampled a few days after hatching. We extracted bird DNA using the QIAmp Tissue Kit (QIAGEN). Eleven microsatellite loci (autosomal loci Bb2, Bb3, Bb7, Bb10, Bb12, Bb20, Bb21, Bb22, Bb23, Bb25, plus the sex-linked Bb11, Molecular Ecology Resources Primer Development Consortium 2010) were amplified by Polymerase Chain Reaction (PCR). Genotypes (number of base pairs at each allele for each locus) were analysed using GeneMapper 4.0 (Applied Biosystems). 118 adults (57 males, 61 females), including those that were genotyped, plus the offspring from 2002 to 2004 included, were sexed using molecular methods (Fridolfsson and Ellegren 1999, cited in our MS). The sex of 48 other adults (18 males, 30 females), including some chicks that later recruited into the breeding population, was inferred from that of their partner for which molecular sexing had been conducted.

To check if the demographic bottleneck experienced by Bulwer’s petrels in the Azores was associated with a genetic bottleneck, we used the BOTTLENECK software, which relies on the method of Cornuet and Luikart (1996, cited in our MS). Relatedness between social partners was estimated using MER (Wang 2002; version 3 downloadable from http://www.zoo.cam.ac.uk/ioz), after excluding the sex-linked locus Bb11.

We tested if there was an assortative mating based on body measurements or structural body size (PC1 scores of a Principal Component Analysis conducted on wing length, tarsus length and culmen length). To do this, we used two methods. First, we considered the pairs that were observed each year and we analysed our study years separately, after conducting Generalized Linear Models (GLMs) or Spearman rank correlations, according to whether or not the conditions for GLMs were met (that is, whether or not model residuals were normally distributed, Kéry and Hatfield 2003, cited in our MS). Second, we considered all the sexed pairs that were observed in our study together. In this situation, however, a given individual could be involved in several pair bonds (after e.g., the death of its former partner and/or a divorce). To overcome this problem, we used the MIXED procedure of SAS (with the Kenward-Roger degrees of freedom method, SAS Institute 2020), an equivalent of Generalized Linear Mixed Models which allows accounting for the correlations between observations concerning the same individual, can use data from individuals for which there are missing observations, allows within-individual effects to consist of continuous variables and to vary for the same individual, and analyses the data in their original form. To do this, we considered female (male) identity as a random effect.

To test whether pairing occurred at random with respect to genetic relatedness, we compared the relatedness of pair mates with that of male-female pairs drawn at random using a resampling procedure implemented in RESAMPLING PROCEDURES Version 1.3 (Howell 2001, cited in our MS), to account for non-independence of individual pairs. The procedure was repeated 5000 times.

To conduct parentage analyses, we compared chick genotypes with those of their social parents, and we excluded paternity (maternity) when the genotype of a chick mismatched that of its social father (mother) at two loci at least. A single mismatch between offspring and parental genotypes was interpreted as a mutation.

Only birds known to have made at least one breeding attempt in the past were used when calculating mate fidelity rates and determining the causes of divorce. Mate fidelity was defined as 1 minus the probability of divorce, the latter parameter being the total number of divorces divided by the total number of pair × years when both previous partners survive from one year to the next during the study period (Black 1996, cited in our MS).

To determine whether (1) reproductive performance (i.e., the probability of fledging chick) increased with pair common experience and (2) whether the probability of divorce depended on pair common experience and previous reproductive performance, we performed logistic regerssions for repeated measures (GENMOD procedure of SAS, binomial distribution, logit link, with the pair as the 'repeated' subject). Results from these logistic regressions were obtained from the models using generalized estimating equations (GEE).

More details are given in the main text of our MS.
Files associated with Christopher Holder and Anand Gnanadesikan, How well do...
zenodo.org
data.niaid.nih.gov
bin
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Holder; Christopher Holder; Anand Gnanadesikan; Anand Gnanadesikan (2023). Files associated with Christopher Holder and Anand Gnanadesikan, How well do Earth System Models capture apparent relationships between phytoplankton biomass and environmental variables? [Version 1] [Dataset]. http://doi.org/10.5281/zenodo.7904142
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7904142
Dataset updated
Jun 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Holder; Christopher Holder; Anand Gnanadesikan; Anand Gnanadesikan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
1. process_cmip_rf.m is a matlab script that reads a single file, generates a random forest using the parameters in the associated paper and computes permutation importance and sensitivities. Note- in order to get process_cmip_rf.m to work as written you must have the Statistics and Machine Learning toolbox installed on Matlab and download the table_modis.asc file below.

Files 2-16 are tabular filew containing all datapoints used in Random Forest analysis for the NCAR CESM2 model. Columns are

1. Index of point, enabling a mapping back to the model grid if the resolution is known.

2. Longitude

3. Latitude

4. Month

5. Iron in mol/m³.

6. Mixed layer in m.

7. Ammonia in mol/m³

8. Nitrate in mol/m³.

9. Phytoplankton carbon in mol/m³.

10. Phosphate in mol/m³.

11. Shortwave radiation (net solar radiation at ocean surface in W/m²).

12. Silicate in mol/m³.

13. Salinity in PSU

14. Temperature in C.

15. Upwelling velocity in m/s.

If variable is not included in the dataset, the column will be filled with zeros.

2.table_cesm2.asc: Data created from Danabasoglu, G., 2019, NCAR CESM model output prepared for CMIP6 CMIP esm-pi-control http://doi.org/10.22033/ESGF/CMIP6.7579. Grid is 360x180x12

3.table_cems2_fv2.asc: Data created from Danabasoglu, G., 2019, NCAR CESM-FV2 model output prepared for CMIP6 CMIP pi-control http://doi.org/10.22033/ESGF/CMIP6.11301. Grid is 360x180x12

4. table_cesm2_waccm.asc: Data created from Danabasoglu, G., 2019, NCAR CESM2-WACCM model output prepared for CMIP6 CMIP piControl http://doi.org/10.22033/ESGF/CMIP6.10094. Grid is 360x180x12

5. table_cesm2_waccm_fv2.asc: Data created from Danabasoglu, G., 2019, NCAR CESM-WACCM-FV2 model output prepared for CMIP CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.11302. Grid is 360x180x12

6. table_gfdl_cm4.asc: Data created from Guo, Huan; John, Jasmin G; Blanton, Chris et al,2018, NOAA-GFDL GFDL-CM4 model output piControl, http://doi.org/10.22033/ESGF/CMIP6.8666. Grid is 360x180x12

7.table_gfdl_esm4.asc Data created from Krasting, John P.; John, Jasmin G; Blanton, Chris et al., 2018, NOAA-GFDL GFDL-ESM4 model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.8669. Grid 360x180x12

8. table_ipsl_cm5a2_inca.asc: Data created from Boucher, Olivier; Denvil, Sébastien; Levavasseur, Guillaume et al.: 2021, IPSL IPSL-CM5A2-INCA model output prepared for CMIP6 CMIP piControl http://doi.org/10.22033/ESGF/CMIP6.13683. Grid is 182x149x12

9. table_ipsl_cm6a_lr.asc: Data created from Boucher, Olivier; Denvil, Sébastien; Levavasseur, Guillaume et al., 2018:, IPSL IPSL-CM6A-LR model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.5251. Grid is 362x332x12.

10. table_mpi_esm1-2-ham.asc: Neubauer, David; Ferrachat, Sylvaine; Siegenthaler-Le Drian, Colombe et al., 2019: HAMMOZ-Consortium MPI-ESM1.2-HAM model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.5037. Grid is 256x220x12.

11. table_mpi_esm1-2-hr.asc: Data created from Jungclaus, Johann; Bittner, Matthias; Wieners, Karl-Hermann et al., 2019: MPI-M MPI-ESM1.2-HR model output prepared for CMIP6 CMIP piControl http://doi.org/10.22033/ESGF/CMIP6.6674. Grid is 802x404x12.

12. table_mpi_esm1-2-lr.asc: Data created from Wieners, Karl-Hermann; Giorgetta, Marco; Jungclaus, Johann et al. 2019:MPI-M MPI-ESM1.2-LR model output prepared for CMIP6 CMIP piControl

http://doi.org/10.22033/ESGF/CMIP6.6675. Grid is 256x220x12.

13. table_noresm2-lm.asc: Seland, Øyvind; Bentsen, Mats; Oliviè, Dirk Jan Leo et al.,2019 NCC NorESM2-LM model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.8217. Grid is 360x385x12

14. table_noresm2-mm.asc: Data created from Bentsen, Mats; Oliviè, Dirk Jan Leo; Seland, Øyvind et al.,2019 : NCC NorESM2-MM model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.8221. Grid is 360x385x12.

15-16. table_kostadinov.asc, table_modis.asc Data is a merger of observational products and model output Observational climatologies for temperature, salinity, mixed layer depth, silicate, phosphate, and nitrate were downloaded from the World Ocean Atlas (WOA) 2018 (Garcia et al., 2019; Locarnini et al., 2019; Zweng et al., 2019). MODIS-POC was downloaded from oceancolor.nasa.gov. Kostadinov POC is taken from https://doi.org/10.1594/PANGAEA.859005 Grid is 360x180x12.
N
Random Lake, WI Population Pyramid Dataset: Age Groups, Male and Female...
neilsberg.com
csv, json
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Random Lake, WI Population Pyramid Dataset: Age Groups, Male and Female Population, and Total Population for Demographics Analysis // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/random-lake-wi-population-by-age/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Wisconsin, Random Lake
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Total Population for Age Groups, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) male population, (b) female population and (b) total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the data for the Random Lake, WI population pyramid, which represents the Random Lake population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.

Key observations

Youth dependency ratio, which is the number of children aged 0-14 per 100 persons aged 15-64, for Random Lake, WI, is 21.2.

Old-age dependency ratio, which is the number of persons aged 65 or over per 100 persons aged 15-64, for Random Lake, WI, is 30.6.

Total dependency ratio for Random Lake, WI is 51.8.

Potential support ratio, which is the number of youth (working age population) per elderly, for Random Lake, WI is 3.3.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group for the Random Lake population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the Random Lake for the selected age group is shown in the following column.

Population (Female): The female population in the Random Lake for the selected age group is shown in the following column.

Total Population: The total population of the Random Lake for the selected age group is shown in the following column.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Random Lake Population by Age. You can refer the same here
Can Humans Really Be Random?
kaggle.com
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam (2021). Can Humans Really Be Random? [Dataset]. https://www.kaggle.com/passwordclassified/can-humans-really-be-random/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2021
Dataset provided by
Kaggle
Authors
Sam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Data

This dataset is a collection of random numbers given by humans to answer the question: is there a pattern to the randomness of human choices? Could AI predict a pattern within a set of human's random choices of 20 numbers?

It is a relatively small dataset, but it is quite comprehensive.
d
The performance of permutations and exponential random graph models when...
search.dataone.org
datadryad.org
+1more
Updated May 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Silk; Julian Evans; David Fisher (2025). The performance of permutations and exponential random graph models when analysing animal networks (R code and data) [Dataset]. http://doi.org/10.5061/dryad.9w0vt4bcn
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.9w0vt4bcn
Dataset updated
May 4, 2025
Dataset provided by
Dryad Digital Repository
Authors
Matthew Silk; Julian Evans; David Fisher
Time period covered
Jan 1, 2020
Description
Social network analysis is a suite of approaches for exploring relational data. Two approaches commonly used to analyse animal social network data are permutation-based tests of significance and exponential random graph models. However, the performance of these approaches when analysing different types of network data has not been simultaneously evaluated. Here we test both approaches to determine their performance when analysing a range of biologically realistic simulated animal social networks. We examined the false positive and false negative error rate of an effect of a two-level explanatory variable (e.g. sex) on the number and combined strength of an individualâ€™s network connections. We measured error rates for two types of simulated data collection methods in a range of network structures, and with/without a confounding effect and missing observations. Both methods performed consistently well in networks of dyadic interactions, and worse on networks constructed using observations...
o
Data from: Why we do not expect dispersal probability density functions...
explore.openaire.eu
data.niaid.nih.gov
+2more
Updated Aug 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roger D. Cousens; Barry D. Hughes; Mohsen B. Mesgaran (2018). Data from: Why we do not expect dispersal probability density functions based on a single mechanism to fit real seed shadows [Dataset]. http://doi.org/10.5061/dryad.70pt4
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.70pt4
Dataset updated
Aug 11, 2018
Authors
Roger D. Cousens; Barry D. Hughes; Mohsen B. Mesgaran
Description
Bullock et al. (Journal of Ecology 105:6-19, 2017) have suggested that the theory behind the Wald Analytical Long Distance (WALD) model for wind dispersal from a point source needs to be re-examined. This is on the basis that an inverse Gaussian probability density function (pdf) does not provide the best fit to seed shadows around individual source plants known to be dispersed by wind. We present two reasons why we would not necessarily expect any of the standard mechanistically derived pdfs to fit real seed shadows any better than empirical functions. Firstly, the derivation of “off-the-shelf” pdfs such as the Gaussian, exponential and inverse Gaussian involves only one of the processes and factors that together generate a real seed shadow. It is implausible to expect that a single-process model, no matter how sophisticated in detail, will capture the behaviour of an entire, complex system, which may involve a number of sequential random processes, or a superposition of parallel random processes, or both. Secondly, even if there is only one process involved and we have a perfect model for that process, the basic parameters of the model would be difficult to pin down precisely. Moreover, these parameters are unlikely to remain constant over a dispersal season, so that effectively we observe the outcome of a linear combination of dispersal events with different parameter values, constituting a form of averaging over the parameters of the distribution. Simple examples show that averaging a pdf over its parameters can lead to a pdf from an entirely different class. Synthesis. The failure of the inverse Gaussian model to fit seed shadow data is not in itself a reason to doubt the validity of the Wald Analytical Long Distance model for movement of particles through the air under specified environmental conditions. A greater awareness is needed of the differences between the Wald Analytical Long Distance and the inverse Gaussian (or Wald) and the purposes for which they are used. The complexity of dispersing populations of seeds means that any of the standard mechanistically derived pdfs will actually be merely empirical in this context. Shape and flexibility of a pdf is far more important for adequately describing data than some perceived higher status. Dispersal simulation codeMatlab code that was used in our paper to simulate numbers of seeds landing in quadrats along a transect based on empirical pdfs with randomly varying parameters.Sim_dispresal.m
Smartwatch Purchase Data
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aayush Chourasiya
Description
Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

----- Version 1 -------

trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

Following Python script was used to generate this dataset

import random import csv # Set the number of examples to generate numExamples = 100000 # Generate the training data with open("trainingData.csv", "w", newline="") as csvfile: fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for i in range(numExamples): age = random.randint(18, 70) income = random.randint(25000, 200000) gender = random.choice(["male", "female"]) maritalStatus = random.choice(["single", "married", "divorced"]) hour = random.randint(0, 23) weekend = random.choice([True, False]) # Randomly assign the label based on some arbitrary conditions # assuming 40% of divorcees won't buy a smart watch if maritalStatus == "divorced" and random.random() < 0.4: buySmartWatch = "no" # assuming sales are 30% more likely to occur on weekends. elif weekend == True and random.random() < 1.3: buySmartWatch = "yes" elif gender == "male" and age < 30 and income > 75000: buySmartWatch = "yes" elif gender == "female" and age >= 30 and income > 100000: buySmartWatch = "yes" else: buySmartWatch = "no" writer.writerow({ "age": age, "income": income, "gender": gender, "maritalStatus": maritalStatus, "hour": hour, "weekend": weekend, "buySmartWatch": buySmartWatch })

----- Version 2 -------

trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

Python script used to generate the data:

import random import csv # Set the number of examples to generate numExamples = 100000 with open("t...
Random Data for Practice
kaggle.com
Updated Jul 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manshubh Singh Rihal (2017). Random Data for Practice [Dataset]. https://www.kaggle.com/manshubh/randomdata/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Manshubh Singh Rihal
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
d
Data from: Measuring rates of phenotypic evolution and the inseparability of...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gene Hunt (2025). Measuring rates of phenotypic evolution and the inseparability of tempo and mode [Dataset]. http://doi.org/10.5061/dryad.c1m60s84
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.c1m60s84
Dataset updated
Apr 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Gene Hunt
Time period covered
Jan 1, 2012
Description
Rates of phenotypic evolution are central to many issues in paleontology, but traditional rate metrics such as darwins or haldanes are seldom used because of their strong dependence on interval length. In this paper, I argue that rates are usefully thought of as model parameters that relate magnitudes of evolutionary divergence to elapsed time. Starting with models of directional evolution, random walks, and stasis, I derive for each a reasonable rate metric. These metrics can be linked to existing approaches in evolutionary biology, and simulations show that they can be estimated accurately at any temporal resolution via maximum likelihood, but only when that metric's underlying model is true. The estimation of generational rates of a random walk under realistic paleontological conditions is compared with simulations to that of a prominent alternative approach, Gingerich's LRI (log-rate, log-interval) method. Generational rates are estimated poorly by LRI; they often reflect sampling e...
f
S1 Data -
plos.figshare.com
txt
Updated Jan 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minkyeong Kim; Doeon Kim; Heeyoung Kang; Seongjin Park; Shinjune Kim; Jun-Il Yoo (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0296282.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296282.s001
Dataset updated
Jan 2, 2024
Dataset provided by
PLOS ONE
Authors
Minkyeong Kim; Doeon Kim; Heeyoung Kang; Seongjin Park; Shinjune Kim; Jun-Il Yoo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectivePatients with Parkinson’s disease (PD) have an increased risk of sarcopenia which is expected to negatively affect gait, leading to poor clinical outcomes including falls. In this study, we investigated the gait patterns of patients with PD with and without sarcopenia (sarcopenia and non-sarcopenia groups, respectively) using an app-derived program and explored if gait parameters could be utilized to predict sarcopenia based on machine learning.MethodsClinical and sarcopenia profiles were collected from patients with PD at Hoehn and Yahr (HY) stage ≤ 2. Sarcopenia was defined based on the updated criteria of the Asian Working Group for Sarcopenia. The gait patterns of the patients with and without sarcopenia were recorded and analyzed using a smartphone application. The random forest model was applied to predict sarcopenia in patients with PD.ResultsData from 38 patients with PD were obtained, among which 9 (23.7%) were with sarcopenia. Clinical parameters were comparable between the sarcopenia and non-sarcopenia groups. Among various clinical and gait parameters, the average range of motion of the hip joint showed the highest association with sarcopenia. Based on the random forest algorithm, the combined difference in knee and ankle angles from standing still before walking to the maximum angle during walking (Kneeankle_diff), the difference between the angle when standing still before walking and the maximum angle during walking for the ankle (Ankle_dif), and the min angle of the hip joint (Hip_min) were the top three features that best predict sarcopenia. The accuracy of this model was 0.949.ConclusionsUsing smartphone app and machine learning technique, our study revealed gait parameters that are associated with sarcopenia and that help predict sarcopenia in PD. Our study showed potential application of advanced technology in clinical research.
Scalable Unsupervised Learning for Unmanned Exploration
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Scalable Unsupervised Learning for Unmanned Exploration [Dataset]. https://data.nasa.gov/dataset/Scalable-Unsupervised-Learning-for-Unmanned-Explor/e4bu-pstu
Explore at:
xml, json, csv, application/rdfxml, application/rssxml, tsvAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Though we dream of the day when humans will first walk on Mars, these dreams remain in the distance. For now, we explore vicariously by sending robotic agents like the Curiosity rover in our stead. Though our current robotic systems are extremely capable, they lack perceptual common sense. This characteristic will be increasingly needed as we create robotic extensions of humanity to reach across the stars, for several reasons. First, robots can go places that humans cannot. If we manage to get a human on Mars by 2035, as predicted by the current NASA timeline, this will still represent a 60 year lag from the time of the first robotic lander. Second, while it is possible to replace common sense in robots with human teleoperated control to some extent, this becomes infeasible as the distance to the base planet and the associated radio signal delay increase. Finally, as we pack more and more sensors onboard, the fraction of data that can be sent back to earth decreases. Data triage (finding the few frames containing a curious object on a planet's surface out of terabytes of data) becomes more important.

In the last few years, research into a class of scalable unsupervised algorithms, also called deep learning algorithms, has blossomed, in part due to state of the art performance in a number of areas. A common thread among many recent deep learning algorithms is that they tend to represent the world in ways similar to how our brains represent the world. For example, thanks to decades of work by neuroscientists, we now know that in the V1 area of the visual cortex, the first region that visual information passes through after the retina, neurons tune themselves to respond to oriented edges and do so in a way that groups them together based on similarity. With this behavior as a goal, researchers set out to devise simple algorithms that reproduce this effect. It turns out that there are several. One, known as Topographic Independent Component Analysis, has each neuron start with random connections and then look for patterns that are statistically out of the ordinary. When it finds one, it locks onto this pattern, discouraging other neurons from duplicating its findings but simultaneously trying to group itself with other neurons that have learned patterns which are similar, but not identical.

My proposed research plan is to develop existing and new unsupervised learning algorithms of this type and apply them to a robotic system. Specifically, I will demonstrate a prototype system capable of (1) learning about itself and its environment and of (2) actively carrying out experiments to learn more about itself and its environment. Research will be kept focused by developing a system aimed at eventual deployment on an unmanned space mission. Key components of the project will include synthetic data experiments, experiments on data recorded from a real robot, and finally experiments with learning in the loop as the robot explores its environment and learns actively.

The unsupervised algorithms in question are applicable not only to a single domain, but to creating models for a wide range of applications. Thus, advances are likely to have far-reaching implications for many areas of autonomous space exploration. Tantalizing though this is, it is equally exciting that unsupervised learning is already finding application with surprisingly impressive performance right now, indicating great promise for near-term application to unmanned space exploration.
Data for "Random forest-based modeling of stream nutrients at national level...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holger Virro; Holger Virro; Alexander Kmoch; Alexander Kmoch; Marko Vainu; Evelyn Uuemaa; Evelyn Uuemaa; Marko Vainu (2024). Data for "Random forest-based modeling of stream nutrients at national level in a data-scarce region" [Dataset]. http://doi.org/10.5281/zenodo.6325312
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6325312
Dataset updated
Sep 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Holger Virro; Holger Virro; Alexander Kmoch; Alexander Kmoch; Marko Vainu; Evelyn Uuemaa; Evelyn Uuemaa; Marko Vainu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The aim of the study was to model annual total nitrogen (TN) and total phosphorus (TP) concentrations at national level using an ML approach. We used water quality data originating from the Environmental Monitoring Database KESE to train RF models for nutrient concentration prediction in 242 catchments across Estonia. A total of 82 environmental variables were used as predictors in the models. In order to yield the best results, a feature selection strategy along with hyperparameter optimization was performed when building the models. The models are applicable for predicting nutrient loads on an annual level, e.g. for the purpose of reporting national level water quality statistics in regional projects, such as HELCOM. The results showed that this relatively basic RF modeling approach can have a performance similar to process-based models. Moreover, these models are easier to reuse and apply on a larger scale, since the required inputs can be derived from freely available datasets (e.g. satellite imagery)

This repository contains the input data used for building the RF models and the files describing the modeling results.

The description of the files is given in the README.txt file.

Virro, H., Kmoch, A., Vainu, M. and Uuemaa, E., 2022. Random forest-based modeling of stream nutrients at national level in a data-scarce region. Science of The Total Environment, 840, p.156613.

https://doi.org/10.1016/j.scitotenv.2022.156613

Facebook

Twitter

Click to copy link

Link copied

Cite

vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest

Customer Churn - Decision Tree & Random Forest

Predicting the Customer Churn for a Telecom Company

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 6, 2023

Dataset provided by

Kaggle

Authors

vikram amin

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Main objective: Find out customers who will churn and who will not.
Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.
Steps Involved
Read the data
Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">

Change character vector to factor vector as this is as classification problem
Drop the variable which is not significant for the analysis. We drop "customerID".
Check for missing values. None are found.
Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.
Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)
Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

Tuning the model
Define the search grid using the expand.grid function
Set up the control parameters through 5 fold cross validation
When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

Predict the model
Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

USE RANDOM FOREST

Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".
Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

Predict the model and create a new data frame showing the actuals vs predicted values

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

Tune the model mtry=2 has the lowest OOB error rate

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

Use random forest with mtry = 2 and ntree = 200

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...

Clear search

Close search

Google apps

Main menu

Customer Churn - Decision Tree & Random Forest

USE RANDOM FOREST

Data Set for Predicting the Performance of ATL Model Transformations Based...

Global Ensemble Digital Terrain Model 30m (GEDTM30)

Data from: A Bayesian hybrid method for the analysis of generalized linear...

Data from: Novel Aggregate Deletion/Substitution/Addition Learning...

Discounting future reward in an uncertain world: behavioural data

Data from: Exploring deep learning techniques for wild animal behaviour...

Data-driven model script for flood severity modeling in Norfolk, VA

Data from: Decreased selectivity during mate choice in a small-sized...

Files associated with Christopher Holder and Anand Gnanadesikan, How well do...

Random Lake, WI Population Pyramid Dataset: Age Groups, Male and Female...

About this dataset

Content

Inspiration

Recommended for further research

Can Humans Really Be Random?

The Data

The performance of permutations and exponential random graph models when...

Data from: Why we do not expect dispersal probability density functions...

Smartwatch Purchase Data

Random Data for Practice

Context

Content

Acknowledgements

Inspiration

Data from: Measuring rates of phenotypic evolution and the inseparability of...

S1 Data -

Scalable Unsupervised Learning for Unmanned Exploration

Data for "Random forest-based modeling of stream nutrients at national level...

Customer Churn - Decision Tree & Random Forest

Predicting the Customer Churn for a Telecom Company

USE RANDOM FOREST