https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt="">
9. Plot the decision tree
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">
Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">
Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.
Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of
independent variables.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".
Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">
Tune the model mtry=2 has the lowest OOB error rate
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">
Use random forest with mtry = 2 and ntree = 200
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicting the execution time of model transformations can help to understand how a transformation reacts to a given input model without creating and transforming the respective model.
In our previous data set (https://doi.org/10.5281/zenodo.8385957), we have documented our experiments in which we predict the performance of ATL transformations using predictive models obtained from training linear regression, random forest and support vector regression. As input for the prediction, our approach uses a characterization of the input model. In these experiments, we only used data from real models.
However, a common problem is that transformation developers do not have enough models available to use such a prediction approach. Therefore, in a new variant of our experiments, we investigated whether the three considered machine learning approaches can predict the performance of transformations if we use data from generated models for training. We also investigated whether it is possible to achieve good predictions with smaller training data. The dataset provided here offers the corresponding raw data, scripts, and results.
A detailed documentation is available in documentaion.pdf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer This is the first release of the Global Ensemble Digital Terrain Model (GEDTM30). Use for testing purposes only. A publication describing the methods used has been submitted to PeerJ and is currently under review. This work was funded by the European Union. However, the views and opinions expressed are solely those of the author(s) and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. The data is provided "as is." The Open-Earth-Monitor project consortium, along with its suppliers and licensors, hereby disclaims all warranties of any kind, express or implied, including, without limitation, warranties of merchantability, fitness for a particular purpose, and non-infringement. Neither the Open-Earth-Monitor project consortium nor its suppliers and licensors make any warranty that the website will be error-free or that access to it will be continuous or uninterrupted. You understand that you download or otherwise obtain content or services from the website at your own discretion and risk. Description GEDTM30 is presented as a 1-arc-second (~30m) global Digital Terrain Model (DTM) generated using machine-learning-based data fusion. It was trained using a global-to-local Random Forest model with ICESat-2 and GEDI data, incorporating almost 30 billion high-quality points. To see the documentation, please visit our GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). This dataset covers the entire world and can be used for applications such as topography, hydrology, and geomorphometry analysis. Dataset Contents This dataset includes: GEDTM30Represents the predicted terrain height. Uncertainty of GEDTM30 predictionProvides an uncertainty map of the terrain prediction, derived from the standard deviation of individual tree predictions in the Random Forest model. Due to Zenodo's storage limitations, the original GEDTM30 dataset and its standard deviation map are provided via external links: GEDTM30 30m Uncertainty of GEDTM30 prediction 30m Related Identifiers Landform:Slope in Degree, Geomorphons Light and Shadow:Positive Openness, Negative Openness, Hillshade Curvature:Minimal Curvature, Maximal Curvature, Profile Curvature, Tangential Curvature, Ring Curvature, Shape Index Local Topographic Position:Difference from Mean Elevation, Spherical Standard Deviation of the Normals Hydrology:Specific Catchment Area, LS Factor, Topographic Wetness Index Data Details Time period: static. Type of data: Digital Terrain Model How the data was collected or derived: Machine learning models. Statistical Methods used: Random Forest. Limitations or exclusions in the data: The dataset does not include data Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180, -65, 180, 85) Spatial resolution: 120m Image size: 360,000P x 178,219L File format: Cloud Optimized Geotiff (COG) format. Layer information: Layer Scale Data Type No Data Ensemble Digital Terrain Model 10 Int32 -2,147,483,647 Standard Deviation EDTM 100 UInt16 65,535 Code Availability The primary development of GEDTM30 is documented in GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). The current version (v1) code is compressed and uploaded as GEDTM30-main.zip. To access the up-to-date development please visit our GitHub page. Support If you discover a bug, artifact, or inconsistency, or if you have a question please raise a GitHub issue here Naming convention To ensure consistency and ease of use across and within the projects, we follow the standard Ai4SoilHealth and Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describe important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. For example, for edtm_rf_m_120m_s_20000101_20231231_go_epsg.4326_v20250130.tif, the fields are: generic variable name: edtm = ensemble digital terrain model variable procedure combination: rf = random forest Position in the probability distribution/variable type: m = mean | sd = standard deviation Spatial support: 120m Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20231231 = 2023-12-31 Bounding box: go = global EPSG code: EPSG:4326 Version code: v20250130 = version from 2025-01-30
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data handling is one of the main problems in modelling, particularly if the missingness is of type missing-not-at-random (MNAR) where missingness occurs due to the actual value of the observation. The focus of the current article is generalized linear modelling of fully observed binary response variables depending on at least one MNAR covariate. For the traditional analysis of such models, an individual model for the probability of missingness is assumed and incorporated in the model framework. However, this probability model is untestable, as the missingness of MNAR data depend on their actual values that would have been observed otherwise. In this article, we consider creating a model space that consist of all possible and plausible models for probability of missingness and develop a hybrid method in which a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm is combined with Bayesian Model Averaging (BMA). RJMCMC is adopted to obtain posterior estimates of model parameters as well as probability of each model in the model space. BMA is used to synthesize coefficient estimates from all models in the model space while accounting for model uncertainties. Through a validation study with a simulated data set and a real data application, the performance of the proposed methodology is found to be satisfactory in accuracy and efficiency of estimates.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many complex diseases are caused by a variety of both genetic and environmental factors acting in conjunction. To help understand these relationships, nonparametric methods that use aggregate learning have been developed such as random forests and conditional forests. Molinaro et al. (2010) described a powerful, single model approach called partDSA that has the advantage of producing interpretable models. We propose two extensions to the partDSA algorithm called bagged partDSA and boosted partDSA. These algorithms achieve higher prediction accuracies than individual partDSA objects through aggregating over a set of partDSA objects. Further, by using partDSA objects in the ensemble, each base learner creates decision rules using both “and” and “or” statements, which allows for natural logical constructs. We also provide four variable ranking techniques that aid in identifying the most important individual factors in the models. In the regression context, we compared bagged partDSA and boosted partDSA to random forests and conditional forests. Using simulated and real data, we found that bagged partDSA had lower prediction error than the other methods if the data were generated by a simple logic model, and that it performed similarly for other generating mechanisms. We also found that boosted partDSA was effective for a particularly complex case. Taken together these results suggest that the new methods are useful additions to the ensemble learning toolbox. We implement these algorithms as part of the partDSA R package. Supplementary materials for this article are available online.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Humans discount delayed relative to more immediate reward. A plausible explanation is that impatience arises partly from uncertainty, or risk, implicit in delayed reward. Existing theories of discounting-as-risk focus on a probability that delayed reward will not materialize. By contrast, we examine how uncertainty in the magnitude of delayed reward contributes to delay discounting. We propose a model wherein reward is discounted proportional to the rate of random change in its magnitude across time, termed volatility. We find evidence to support this model across three experiments (total N=158). Firstly, using a task where participants chose when to sell products, whose price dynamics they previously learned, we show discounting increases in line with price volatility. Secondly, we show that this effect pertains over naturalistic delays of up to four months. Using functional magnetic resonance imaging, we observe a volatility-dependent decrease in functional hippocampal-prefrontal coupling during intertemporal choice. Thirdly, we replicate these effects in a larger online sample, finding that volatility discounting within each task correlates with baseline discounting outside of the task. We conclude that delay discounting partly reflects time-dependent uncertainty about reward magnitude, i.e. volatility. Our model captures how discounting adapts to volatility, thereby partly accounting for individual differences in impatience. Our imaging findings suggest a putative mechanism whereby uncertainty reduces prospective simulation of future outcomes. Methods Experiment 1 In Experiment 1 participants were briefed to imagine that they owned a farming business, selling produce to the highest bidder in a marketplace. Participants learned how the prices of three different products (wheat, chicken and beans) evolved week-by-week, where a week corresponded to a trial of the experiment (Figure 2). The three products had different levels of volatility in price evolution. Participants subsequently made intertemporal choices about when to sell each product, either immediately for a guaranteed price or in the marketplace following a delay. Methods Participant Recruitment and Sample Size This experiment was designed as a pilot, and thereby focused on testing for larger, within participant, effects. Participants were recruited from the UCL Institute of Cognitive Neuroscience subject database. 20 participants (mean age 27.4 years, s. d. 6.9 years; 9 female) completed the experiment. Baseline Discounting Prior to the main task we elicited discount functions for riskless quantities of money. Participants were required to indicate the smallest immediate monetary reward, termed their indifference amount, that they would be willing to accept instead of a larger stated quantity of money (£8, £9, £11 or £12) to be received at a specified delay (1, 2, 4, 26 or 52 weeks). Each delay was presented twice for each larger reward amount, creating 40 choices in total. One choice was selected to be paid for real, at the stated delay, in post-dated Amazon vouchers. To achieve this in an incentive-compatible manner, for the selected choice, we randomly selected an immediate reward from a uniform distribution between £0 and the magnitude of the larger reward (e.g., £12); if this amount was below or equal to the participant’s stated indifference point, they received the delayed reward, if above the indifference point they received the randomly-drawn immediate reward. Participants were fully briefed on this procedure. Three participants who answered £0 in response to all baseline questions were excluded from this analysis. Learning Price Dynamics During the task, participants observed and predicted the price of each product, displayed on a linear scale ranging from £0 to £25, as it evolved over the course of 240 trials. Each trial of the experiment was described as a ‘week’. After passively observing prices over several ‘weeks’ (trials), participants were asked to predict upcoming prices one week ahead; the task therefore involved both observational and instrumental learning. Participants were instructed about two sources of variability in prices: Gaussian emission noise, applying equally to all products, which we described as ‘variability in bidding’, and changes in the underlying ‘market price’. For one of the three products (‘No Volatility’) the market price was held constant; the market price of the other two products (‘Low Volatility’ and ‘High Volatility’) underwent random changes across time, with the same Gaussian emission noise. We used two predefined sequences of outcomes for each product; participants were then allocated at random to one of the two sequences. We estimated learning rates for the three products separately by fitting a Rescorla-Wagner learning model (Rescorla & Rescorla, 1967) to participants’ price predictions from the first block of 70 prediction trials. Intertemporal Choice Procedure At three points during each block, participants were asked to predict the market price further into the future, at delays of 1, 4, 7, 12 or 18 weeks. Participants subsequently chose when to sell the product, either immediately for a fixed price (x), or on the market after a stated delay (1, 4, 7, 12 or 18 weeks). Specifically, they were asked to indicate the smallest fixed price that would just tempt them away from selling on the market. Participants were informed that the future price would evolve according to the same process they had previously observed, and was also subject to the same Gaussian emission noise. By contrast, the immediate price was fixed, with no objective risk. Participants were informed that, after the experiment, we would select one of their choices to be paid out for real. To realise this in an incentive-compatible manner, for the selected choice, we randomly selected an immediate fixed price from a uniform distribution between £0 and £25; if this amount was below the participant’s stated indifference point, they received the simulated future market price for the product as a bonus payment. If the selected price was above the participant’s indifference point they received the randomly-drawn fixed price. All bonus payments were made on the same day, at the end of the experiment. Trial Structure of Learning Phase For a ‘No Volatility’ product the market price was held constant. The market price of the other two products (‘Low Volatility’ and ‘High Volatility’) underwent random changes across time. Price trajectories for these two products were simulated by implementing a time-dependent probability that the market price would change to a new value, selected from a uniform distribution between specified bounds. For a ‘Low Volatility’ product, changes in the market price were small, while for a ‘High Volatility’ product, changes were more extreme. Within each block, participants performed three phases of observation and prediction: the first consisted of 70 observation trials followed by 70 prediction trials, while the subsequent two phases each consisted of 45 observation trials and 5 prediction trials. After each phase the price evolution was paused whilst participants made a set of intertemporal choices. Learning rates were fitted based on the first 70 prediction trials; subsequent prediction phases were included to ensure that participants attended to prices before making intertemporal choices. Experiment 2 Experiment 2 tested whether the effects observed in Experiment 1 replicated in a larger sample, and also probed neural correlates of volatility discounting. Here, to test whether effects of volatility extend to timescales used in conventional discounting tasks, we superimposed the timescale of the task onto longer delays. Specifically, one actual intertemporal choice was selected to be paid out at the stated delay, in the order of weeks. To further test the veridicality of the model, we measured risk aversion outside the main task, and elicited participants’ subjective estimates of future uncertainty within-task. Methods Learning Phase Participants learned price dynamics according to a similar procedure as described for Experiment 1. Here only two products were used, to simplify the neuroimaging analysis. For one of the two products (‘Stable’) the market price was held constant at £25, and participants were explicitly informed about this; the market price of the other product (‘Volatile’) evolved according to a Gaussian random walk, with zero mean drift and volatility σ=3.5, upper bounded at £50 and lower bounded at £0. We used two predefined sequences of outcomes sampled from a random walk with these properties; participants were then allocated at random to one of the two sequences. Participants first passively observed the price of each product, displayed on a linear scale ranging from £0 to £50, as it evolved over the course of 240 trials. Over a further 240 trials they were asked to predict upcoming prices. Prices for the two products were displayed in randomly ordered mini-blocks of 60 trials in length; at the start of each block the market price was reset to £25. Price predictions followed the same procedure as in Experiment 1. For the Stable product, participants were instructed the future market price would remain constant at £25, whereas for the Volatile product the future market price would drift according to the very same process they had previously observed. In both conditions, future prices were also subject to the same degree of emission noise. Description of Emission Noise in the Learning Phase During the learning phase, participants were explicitly instructed about two sources of variability in prices: an irreducible Gaussian noise (?=2) applying equally to both items, which we described as ‘variability in online bidding’, and drift in the underlying ‘market price’. To facilitate this explanation, in a
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.
This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Original run configurations: R version = 3.3.3 Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Packages used: 'randomForest' (version 4.6-12) 'caret' (version 6.0-73)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
As biparental care is crucial for breeding success in Procellariiformes seabirds (i.e., albatrosses and petrels), these species are expected to be choosy during pair formation. However, the choice of partners is limited in small-sized populations, which might lead to random pairing. In Procellariiformes, the consequences of such limitations for mating strategies have been examined in a single species. Here, we studied mate choice in another Procellariiforme, Bulwer’s petrel Bulweria bulwerii, in the Azores (ca 70 breeding pairs), where the species has suffered a dramatic population decline. We based our approach on both a 11-year demographic survey (capture-mark-recapture) and a genetic approach (microsatellites, n = 127 individuals). The genetic data suggest that this small population is not inbred and did not experience a genetic bottleneck. Moreover, pairing occurred randomly with respect to genetic relatedness, we detected no extrapair parentage (n = 35 offspring), and pair fecundity was unrelated to relatedness between partners. From our demographic survey, we detected no assortative mating with respect to body measurements and breeding experience and observed very few divorces, most of which were probably forced. This contrasts with the pattern previously observed in the much larger population from the Selvagens archipelago (assortative mating with respect to bill size and high divorce rate). We suggest that the Bulwer’s petrels from the Azores pair with any available partner and retain it as long as possible despite the fact that reproductive performance did not improve with pair common experience, possibly to avoid skipping breeding years in case of divorce. We recommend determining whether decreased choosiness during mate choice also occurs in reduced populations of other Procellariiform species. This might have implications for the conservation of small threatened seabird populations.
Methods Field work was conducted on Vila islet, Santa Maria island, Azores archipelago, from 2002 to 2012 included. Adults were captured in their nesting burrows each year during incubation, and ringed for identification. Chicks were ringed before fledging. These capture-mark-recapture sessions enabled us to know the life-history of each ringed individual, year after year, that is, the nest it was occupying (nesting cavities were marked with individual numbers), whether or not it was breeding, the outcomes of its breeding attempts, the identity of its social partner(s) and its offspring. Adults were measured (wing length using a stopped ruler to the nearest mm; tarsus length, culmen length and bill depth at the gonys using a vernier calliper to the nearest 0.1 mm).
Blood samples (50-100 µl) were collected from adults upon their first capture in 2002, 2003 and 2004. . Chicks were sampled a few days after hatching. We extracted bird DNA using the QIAmp Tissue Kit (QIAGEN). Eleven microsatellite loci (autosomal loci Bb2, Bb3, Bb7, Bb10, Bb12, Bb20, Bb21, Bb22, Bb23, Bb25, plus the sex-linked Bb11, Molecular Ecology Resources Primer Development Consortium 2010) were amplified by Polymerase Chain Reaction (PCR). Genotypes (number of base pairs at each allele for each locus) were analysed using GeneMapper 4.0 (Applied Biosystems). 118 adults (57 males, 61 females), including those that were genotyped, plus the offspring from 2002 to 2004 included, were sexed using molecular methods (Fridolfsson and Ellegren 1999, cited in our MS). The sex of 48 other adults (18 males, 30 females), including some chicks that later recruited into the breeding population, was inferred from that of their partner for which molecular sexing had been conducted.
To check if the demographic bottleneck experienced by Bulwer’s petrels in the Azores was associated with a genetic bottleneck, we used the BOTTLENECK software, which relies on the method of Cornuet and Luikart (1996, cited in our MS). Relatedness between social partners was estimated using MER (Wang 2002; version 3 downloadable from http://www.zoo.cam.ac.uk/ioz), after excluding the sex-linked locus Bb11.
We tested if there was an assortative mating based on body measurements or structural body size (PC1 scores of a Principal Component Analysis conducted on wing length, tarsus length and culmen length). To do this, we used two methods. First, we considered the pairs that were observed each year and we analysed our study years separately, after conducting Generalized Linear Models (GLMs) or Spearman rank correlations, according to whether or not the conditions for GLMs were met (that is, whether or not model residuals were normally distributed, Kéry and Hatfield 2003, cited in our MS). Second, we considered all the sexed pairs that were observed in our study together. In this situation, however, a given individual could be involved in several pair bonds (after e.g., the death of its former partner and/or a divorce). To overcome this problem, we used the MIXED procedure of SAS (with the Kenward-Roger degrees of freedom method, SAS Institute 2020), an equivalent of Generalized Linear Mixed Models which allows accounting for the correlations between observations concerning the same individual, can use data from individuals for which there are missing observations, allows within-individual effects to consist of continuous variables and to vary for the same individual, and analyses the data in their original form. To do this, we considered female (male) identity as a random effect.
To test whether pairing occurred at random with respect to genetic relatedness, we compared the relatedness of pair mates with that of male-female pairs drawn at random using a resampling procedure implemented in RESAMPLING PROCEDURES Version 1.3 (Howell 2001, cited in our MS), to account for non-independence of individual pairs. The procedure was repeated 5000 times.
To conduct parentage analyses, we compared chick genotypes with those of their social parents, and we excluded paternity (maternity) when the genotype of a chick mismatched that of its social father (mother) at two loci at least. A single mismatch between offspring and parental genotypes was interpreted as a mutation.
Only birds known to have made at least one breeding attempt in the past were used when calculating mate fidelity rates and determining the causes of divorce. Mate fidelity was defined as 1 minus the probability of divorce, the latter parameter being the total number of divorces divided by the total number of pair × years when both previous partners survive from one year to the next during the study period (Black 1996, cited in our MS).
To determine whether (1) reproductive performance (i.e., the probability of fledging chick) increased with pair common experience and (2) whether the probability of divorce depended on pair common experience and previous reproductive performance, we performed logistic regerssions for repeated measures (GENMOD procedure of SAS, binomial distribution, logit link, with the pair as the 'repeated' subject). Results from these logistic regressions were obtained from the models using generalized estimating equations (GEE).
More details are given in the main text of our MS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1. process_cmip_rf.m is a matlab script that reads a single file, generates a random forest using the parameters in the associated paper and computes permutation importance and sensitivities. Note- in order to get process_cmip_rf.m to work as written you must have the Statistics and Machine Learning toolbox installed on Matlab and download the table_modis.asc file below.
Files 2-16 are tabular filew containing all datapoints used in Random Forest analysis for the NCAR CESM2 model. Columns are
1. Index of point, enabling a mapping back to the model grid if the resolution is known.
2. Longitude
3. Latitude
4. Month
5. Iron in mol/m3.
6. Mixed layer in m.
7. Ammonia in mol/m3
8. Nitrate in mol/m3.
9. Phytoplankton carbon in mol/m3.
10. Phosphate in mol/m3.
11. Shortwave radiation (net solar radiation at ocean surface in W/m2).
12. Silicate in mol/m3.
13. Salinity in PSU
14. Temperature in C.
15. Upwelling velocity in m/s.
If variable is not included in the dataset, the column will be filled with zeros.
2.table_cesm2.asc: Data created from Danabasoglu, G., 2019, NCAR CESM model output prepared for CMIP6 CMIP esm-pi-control http://doi.org/10.22033/ESGF/CMIP6.7579. Grid is 360x180x12
3.table_cems2_fv2.asc: Data created from Danabasoglu, G., 2019, NCAR CESM-FV2 model output prepared for CMIP6 CMIP pi-control http://doi.org/10.22033/ESGF/CMIP6.11301. Grid is 360x180x12
4. table_cesm2_waccm.asc: Data created from Danabasoglu, G., 2019, NCAR CESM2-WACCM model output prepared for CMIP6 CMIP piControl http://doi.org/10.22033/ESGF/CMIP6.10094. Grid is 360x180x12
5. table_cesm2_waccm_fv2.asc: Data created from Danabasoglu, G., 2019, NCAR CESM-WACCM-FV2 model output prepared for CMIP CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.11302. Grid is 360x180x12
6. table_gfdl_cm4.asc: Data created from Guo, Huan; John, Jasmin G; Blanton, Chris et al,2018, NOAA-GFDL GFDL-CM4 model output piControl, http://doi.org/10.22033/ESGF/CMIP6.8666. Grid is 360x180x12
7.table_gfdl_esm4.asc Data created from Krasting, John P.; John, Jasmin G; Blanton, Chris et al., 2018, NOAA-GFDL GFDL-ESM4 model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.8669. Grid 360x180x12
8. table_ipsl_cm5a2_inca.asc: Data created from Boucher, Olivier; Denvil, Sébastien; Levavasseur, Guillaume et al.: 2021, IPSL IPSL-CM5A2-INCA model output prepared for CMIP6 CMIP piControl http://doi.org/10.22033/ESGF/CMIP6.13683. Grid is 182x149x12
9. table_ipsl_cm6a_lr.asc: Data created from Boucher, Olivier; Denvil, Sébastien; Levavasseur, Guillaume et al., 2018:, IPSL IPSL-CM6A-LR model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.5251. Grid is 362x332x12.
10. table_mpi_esm1-2-ham.asc: Neubauer, David; Ferrachat, Sylvaine; Siegenthaler-Le Drian, Colombe et al., 2019: HAMMOZ-Consortium MPI-ESM1.2-HAM model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.5037. Grid is 256x220x12.
11. table_mpi_esm1-2-hr.asc: Data created from Jungclaus, Johann; Bittner, Matthias; Wieners, Karl-Hermann et al., 2019: MPI-M MPI-ESM1.2-HR model output prepared for CMIP6 CMIP piControl http://doi.org/10.22033/ESGF/CMIP6.6674. Grid is 802x404x12.
12. table_mpi_esm1-2-lr.asc: Data created from Wieners, Karl-Hermann; Giorgetta, Marco; Jungclaus, Johann et al. 2019:MPI-M MPI-ESM1.2-LR model output prepared for CMIP6 CMIP piControl
http://doi.org/10.22033/ESGF/CMIP6.6675. Grid is 256x220x12.
13. table_noresm2-lm.asc: Seland, Øyvind; Bentsen, Mats; Oliviè, Dirk Jan Leo et al.,2019 NCC NorESM2-LM model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.8217. Grid is 360x385x12
14. table_noresm2-mm.asc: Data created from Bentsen, Mats; Oliviè, Dirk Jan Leo; Seland, Øyvind et al.,2019 : NCC NorESM2-MM model output prepared for CMIP6 CMIP piControl, http://doi.org/10.22033/ESGF/CMIP6.8221. Grid is 360x385x12.
15-16. table_kostadinov.asc, table_modis.asc Data is a merger of observational products and model output Observational climatologies for temperature, salinity, mixed layer depth, silicate, phosphate, and nitrate were downloaded from the World Ocean Atlas (WOA) 2018 (Garcia et al., 2019; Locarnini et al., 2019; Zweng et al., 2019). MODIS-POC was downloaded from oceancolor.nasa.gov. Kostadinov POC is taken from https://doi.org/10.1594/PANGAEA.859005 Grid is 360x180x12.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the data for the Random Lake, WI population pyramid, which represents the Random Lake population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Random Lake Population by Age. You can refer the same here
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a collection of random numbers given by humans to answer the question: is there a pattern to the randomness of human choices? Could AI predict a pattern within a set of human's random choices of 20 numbers?
It is a relatively small dataset, but it is quite comprehensive.
Social network analysis is a suite of approaches for exploring relational data. Two approaches commonly used to analyse animal social network data are permutation-based tests of significance and exponential random graph models. However, the performance of these approaches when analysing different types of network data has not been simultaneously evaluated. Here we test both approaches to determine their performance when analysing a range of biologically realistic simulated animal social networks. We examined the false positive and false negative error rate of an effect of a two-level explanatory variable (e.g. sex) on the number and combined strength of an individual’s network connections. We measured error rates for two types of simulated data collection methods in a range of network structures, and with/without a confounding effect and missing observations. Both methods performed consistently well in networks of dyadic interactions, and worse on networks constructed using observations...
Bullock et al. (Journal of Ecology 105:6-19, 2017) have suggested that the theory behind the Wald Analytical Long Distance (WALD) model for wind dispersal from a point source needs to be re-examined. This is on the basis that an inverse Gaussian probability density function (pdf) does not provide the best fit to seed shadows around individual source plants known to be dispersed by wind. We present two reasons why we would not necessarily expect any of the standard mechanistically derived pdfs to fit real seed shadows any better than empirical functions. Firstly, the derivation of “off-the-shelf” pdfs such as the Gaussian, exponential and inverse Gaussian involves only one of the processes and factors that together generate a real seed shadow. It is implausible to expect that a single-process model, no matter how sophisticated in detail, will capture the behaviour of an entire, complex system, which may involve a number of sequential random processes, or a superposition of parallel random processes, or both. Secondly, even if there is only one process involved and we have a perfect model for that process, the basic parameters of the model would be difficult to pin down precisely. Moreover, these parameters are unlikely to remain constant over a dispersal season, so that effectively we observe the outcome of a linear combination of dispersal events with different parameter values, constituting a form of averaging over the parameters of the distribution. Simple examples show that averaging a pdf over its parameters can lead to a pdf from an entirely different class. Synthesis. The failure of the inverse Gaussian model to fit seed shadow data is not in itself a reason to doubt the validity of the Wald Analytical Long Distance model for movement of particles through the air under specified environmental conditions. A greater awareness is needed of the differences between the Wald Analytical Long Distance and the inverse Gaussian (or Wald) and the purposes for which they are used. The complexity of dispersing populations of seeds means that any of the standard mechanistically derived pdfs will actually be merely empirical in this context. Shape and flexibility of a pdf is far more important for adequately describing data than some perceived higher status. Dispersal simulation codeMatlab code that was used in our paper to simulate numbers of seeds landing in quadrats along a transect based on empirical pdfs with randomly varying parameters.Sim_dispresal.m
Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.
The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.
----- Version 1 -------
trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".
The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.
Following Python script was used to generate this dataset
import random
import csv
# Set the number of examples to generate
numExamples = 100000
# Generate the training data
with open("trainingData.csv", "w", newline="") as csvfile:
fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(numExamples):
age = random.randint(18, 70)
income = random.randint(25000, 200000)
gender = random.choice(["male", "female"])
maritalStatus = random.choice(["single", "married", "divorced"])
hour = random.randint(0, 23)
weekend = random.choice([True, False])
# Randomly assign the label based on some arbitrary conditions
# assuming 40% of divorcees won't buy a smart watch
if maritalStatus == "divorced" and random.random() < 0.4:
buySmartWatch = "no"
# assuming sales are 30% more likely to occur on weekends.
elif weekend == True and random.random() < 1.3:
buySmartWatch = "yes"
elif gender == "male" and age < 30 and income > 75000:
buySmartWatch = "yes"
elif gender == "female" and age >= 30 and income > 100000:
buySmartWatch = "yes"
else:
buySmartWatch = "no"
writer.writerow({
"age": age,
"income": income,
"gender": gender,
"maritalStatus": maritalStatus,
"hour": hour,
"weekend": weekend,
"buySmartWatch": buySmartWatch
})
----- Version 2 -------
trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)
Python script used to generate the data:
import random
import csv
# Set the number of examples to generate
numExamples = 100000
with open("t...
There's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Rates of phenotypic evolution are central to many issues in paleontology, but traditional rate metrics such as darwins or haldanes are seldom used because of their strong dependence on interval length. In this paper, I argue that rates are usefully thought of as model parameters that relate magnitudes of evolutionary divergence to elapsed time. Starting with models of directional evolution, random walks, and stasis, I derive for each a reasonable rate metric. These metrics can be linked to existing approaches in evolutionary biology, and simulations show that they can be estimated accurately at any temporal resolution via maximum likelihood, but only when that metric's underlying model is true. The estimation of generational rates of a random walk under realistic paleontological conditions is compared with simulations to that of a prominent alternative approach, Gingerich's LRI (log-rate, log-interval) method. Generational rates are estimated poorly by LRI; they often reflect sampling e...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectivePatients with Parkinson’s disease (PD) have an increased risk of sarcopenia which is expected to negatively affect gait, leading to poor clinical outcomes including falls. In this study, we investigated the gait patterns of patients with PD with and without sarcopenia (sarcopenia and non-sarcopenia groups, respectively) using an app-derived program and explored if gait parameters could be utilized to predict sarcopenia based on machine learning.MethodsClinical and sarcopenia profiles were collected from patients with PD at Hoehn and Yahr (HY) stage ≤ 2. Sarcopenia was defined based on the updated criteria of the Asian Working Group for Sarcopenia. The gait patterns of the patients with and without sarcopenia were recorded and analyzed using a smartphone application. The random forest model was applied to predict sarcopenia in patients with PD.ResultsData from 38 patients with PD were obtained, among which 9 (23.7%) were with sarcopenia. Clinical parameters were comparable between the sarcopenia and non-sarcopenia groups. Among various clinical and gait parameters, the average range of motion of the hip joint showed the highest association with sarcopenia. Based on the random forest algorithm, the combined difference in knee and ankle angles from standing still before walking to the maximum angle during walking (Kneeankle_diff), the difference between the angle when standing still before walking and the maximum angle during walking for the ankle (Ankle_dif), and the min angle of the hip joint (Hip_min) were the top three features that best predict sarcopenia. The accuracy of this model was 0.949.ConclusionsUsing smartphone app and machine learning technique, our study revealed gait parameters that are associated with sarcopenia and that help predict sarcopenia in PD. Our study showed potential application of advanced technology in clinical research.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Though we dream of the day when humans will first walk on Mars, these dreams remain in the distance. For now, we explore vicariously by sending robotic agents like the Curiosity rover in our stead. Though our current robotic systems are extremely capable, they lack perceptual common sense. This characteristic will be increasingly needed as we create robotic extensions of humanity to reach across the stars, for several reasons. First, robots can go places that humans cannot. If we manage to get a human on Mars by 2035, as predicted by the current NASA timeline, this will still represent a 60 year lag from the time of the first robotic lander. Second, while it is possible to replace common sense in robots with human teleoperated control to some extent, this becomes infeasible as the distance to the base planet and the associated radio signal delay increase. Finally, as we pack more and more sensors onboard, the fraction of data that can be sent back to earth decreases. Data triage (finding the few frames containing a curious object on a planet's surface out of terabytes of data) becomes more important.
In the last few years, research into a class of scalable unsupervised algorithms, also called deep learning algorithms, has blossomed, in part due to state of the art performance in a number of areas. A common thread among many recent deep learning algorithms is that they tend to represent the world in ways similar to how our brains represent the world. For example, thanks to decades of work by neuroscientists, we now know that in the V1 area of the visual cortex, the first region that visual information passes through after the retina, neurons tune themselves to respond to oriented edges and do so in a way that groups them together based on similarity. With this behavior as a goal, researchers set out to devise simple algorithms that reproduce this effect. It turns out that there are several. One, known as Topographic Independent Component Analysis, has each neuron start with random connections and then look for patterns that are statistically out of the ordinary. When it finds one, it locks onto this pattern, discouraging other neurons from duplicating its findings but simultaneously trying to group itself with other neurons that have learned patterns which are similar, but not identical.
My proposed research plan is to develop existing and new unsupervised learning algorithms of this type and apply them to a robotic system. Specifically, I will demonstrate a prototype system capable of (1) learning about itself and its environment and of (2) actively carrying out experiments to learn more about itself and its environment. Research will be kept focused by developing a system aimed at eventual deployment on an unmanned space mission. Key components of the project will include synthetic data experiments, experiments on data recorded from a real robot, and finally experiments with learning in the loop as the robot explores its environment and learns actively.
The unsupervised algorithms in question are applicable not only to a single domain, but to creating models for a wide range of applications. Thus, advances are likely to have far-reaching implications for many areas of autonomous space exploration. Tantalizing though this is, it is equally exciting that unsupervised learning is already finding application with surprisingly impressive performance right now, indicating great promise for near-term application to unmanned space exploration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of the study was to model annual total nitrogen (TN) and total phosphorus (TP) concentrations at national level using an ML approach. We used water quality data originating from the Environmental Monitoring Database KESE to train RF models for nutrient concentration prediction in 242 catchments across Estonia. A total of 82 environmental variables were used as predictors in the models. In order to yield the best results, a feature selection strategy along with hyperparameter optimization was performed when building the models. The models are applicable for predicting nutrient loads on an annual level, e.g. for the purpose of reporting national level water quality statistics in regional projects, such as HELCOM. The results showed that this relatively basic RF modeling approach can have a performance similar to process-based models. Moreover, these models are easier to reuse and apply on a larger scale, since the required inputs can be derived from freely available datasets (e.g. satellite imagery)
This repository contains the input data used for building the RF models and the files describing the modeling results.
The description of the files is given in the README.txt file.
Virro, H., Kmoch, A., Vainu, M. and Uuemaa, E., 2022. Random forest-based modeling of stream nutrients at national level in a data-scarce region. Science of The Total Environment, 840, p.156613.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt="">
9. Plot the decision tree
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">
Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">
Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.
Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of
independent variables.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".
Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">
Tune the model mtry=2 has the lowest OOB error rate
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">
Use random forest with mtry = 2 and ntree = 200
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...