35 datasets found
  1. f

    Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. R

    Replication data for: "Split Decisions: Household Finance When a Policy...

    • dataverse.iza.org
    • dataverse.harvard.edu
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael A. Clemens; Michael A. Clemens; Erwin R. Tiongson; Erwin R. Tiongson (2024). Replication data for: "Split Decisions: Household Finance When a Policy Discontinuity Allocates Overseas Work" [Dataset]. http://doi.org/10.7910/DVN/2DO8QP
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Research Data Center of IZA (IDSC)
    Authors
    Michael A. Clemens; Michael A. Clemens; Erwin R. Tiongson; Erwin R. Tiongson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Clemens, Michael A., and Tiongson, Erwin R., (2017) "Split Decisions: Household Finance When a Policy Discontinuity Allocates Overseas Work." Review of Economics and Statistics 99:3, 531-543.

  3. d

    Data from: Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice:...

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mason, Georgia; Walker, Michael (2023). Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice: Validating a split-plot design that promotes refinement and reduction [Dataset]. https://search.dataone.org/view/sha256%3A2b1ace7be31b90c0a2cf6859c8ec9dc108595d64d1ead30a0bfe0477100a52a8
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Mason, Georgia; Walker, Michael
    Time period covered
    May 1, 2013 - Aug 1, 2013
    Description

    Validating a novel housing method for inbred mice: mixed-strain housing. To see if this housing method affected strain-typical mouse phenotypes, if variance in the data was affected, and how statistical power was increased through this split-plot design.

  4. Data from: Regression with Empirical Variable Selection: Description of a...

    • plos.figshare.com
    txt
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

  5. Val split & vocab file

    • kaggle.com
    zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devi Hemamalini R (2024). Val split & vocab file [Dataset]. https://www.kaggle.com/datasets/devihemamalinir/val-split-and-vocab-file
    Explore at:
    zip(1603266139 bytes)Available download formats
    Dataset updated
    Jul 6, 2024
    Authors
    Devi Hemamalini R
    Description

    Dataset

    This dataset was created by Devi Hemamalini R

    Contents

  6. h

    haoranxu_ALMA-13B-R-details

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). haoranxu_ALMA-13B-R-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/haoranxu_ALMA-13B-R-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of haoranxu/ALMA-13B-R

    Dataset automatically created during the evaluation run of model haoranxu/ALMA-13B-R The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/haoranxu_ALMA-13B-R-details.

  7. h

    CohereForAI_c4ai-command-r-plus-08-2024-details

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). CohereForAI_c4ai-command-r-plus-08-2024-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/CohereForAI_c4ai-command-r-plus-08-2024-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of CohereForAI/c4ai-command-r-plus-08-2024

    Dataset automatically created during the evaluation run of model CohereForAI/c4ai-command-r-plus-08-2024 The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/CohereForAI_c4ai-command-r-plus-08-2024-details.

  8. d

    Data from: FFT-split-operator code for solving the Dirac equation in 2+1...

    • elsevier.digitalcommonsdata.com
    Updated Jun 1, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guido R. Mocken (2008). FFT-split-operator code for solving the Dirac equation in 2+1 dimensions [Dataset]. http://doi.org/10.17632/43v3vvkwwf.1
    Explore at:
    Dataset updated
    Jun 1, 2008
    Authors
    Guido R. Mocken
    License

    https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/

    Description

    Abstract The main part of the code presented in this work represents an implementation of the split-operator method [J.A. Fleck, J.R. Morris, M.D. Feit, Appl. Phys. 10 (1976) 129-160; R. Heather, Comput. Phys. Comm. 63 (1991) 446] for calculating the time-evolution of Dirac wave functions. It allows to study the dynamics of electronic Dirac wave packets under the influence of any number of laser pulses and its interaction with any number of charged ion potentials. The initial wave function can be eith...

    Title of program: Dirac++ or (abbreviated) d++ Catalogue Id: AEAS_v1_0

    Nature of problem The relativistic time evolution of wave functions according to the Dirac equation is a challenging numerical task. Especially for an electron in the presence of high intensity laser beams and/or highly charged ions, this type of problem is of considerable interest to atomic physicists.

    Versions of this program held in the CPC repository in Mendeley Data AEAS_v1_0; Dirac++ or (abbreviated) d++; 10.1016/j.cpc.2008.01.042

    This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)

  9. Housing Price Prediction using DT and RF in R

    • kaggle.com
    zip
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
    Explore at:
    zip(629100 bytes)Available download formats
    Dataset updated
    Aug 31, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Objective: To predict the prices of houses in the City of Melbourne
    • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
    • Data Cleaning:
    • Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
    • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
    • We remove 11566 records which have missing values
    • We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
    • We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
    • Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
    • Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
    • Average price for 5464 houses is $1084349
    • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
    • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
    • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
    • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
    • Variables ‘postcode’, longitude and building are the most important variables
    • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
    • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
    • The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
    • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
    • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
    • We tune the model and find mtry = 3 has the lowest out of bag error
    • We use the caret package and use 5 fold cross validation technique
    • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
    • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
    • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
  10. f

    The Pearson correlation coefficients (r) of diversity measures based on...

    • datasetcatalog.nlm.nih.gov
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhari, Niloufar; Tupper, Paul; Mooers, Arne; Colijn, Caroline (2024). The Pearson correlation coefficients (r) of diversity measures based on heterozygosity and split system diversity applied on subsets of Atlantic salmon populations with size k = 2, 3, and 4. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001353332
    Explore at:
    Dataset updated
    Dec 4, 2024
    Authors
    Abhari, Niloufar; Tupper, Paul; Mooers, Arne; Colijn, Caroline
    Description

    The Pearson correlation coefficients (r) of diversity measures based on heterozygosity and split system diversity applied on subsets of Atlantic salmon populations with size k = 2, 3, and 4.

  11. d

    Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States, United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  12. Z

    Data from: Long-term spatial memory, across large spatial scales, in...

    • data.niaid.nih.gov
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priscila A Moura; Fletcher J Young; Monica Monllor; Marcio Z Cardoso; Stephen H Montgomery (2023). Long-term spatial memory, across large spatial scales, in Heliconius butterflies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7985235
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset provided by
    Departamento de Ecologia, Instituto de Biologia, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil
    Departamento de Ecologia, Universidade Federal do Rio Grande do Norte, Natal, RN, Brazil
    School of Biological Sciences, University of Bristol, Bristol, UK
    Authors
    Priscila A Moura; Fletcher J Young; Monica Monllor; Marcio Z Cardoso; Stephen H Montgomery
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data accompanying "Long-term spatial memory, across large spatial scales, in Heliconius butterflies", Current Biology 2023:

    exp1.csv. Behavioural data from experiment 1.

    exp2.csv. Behavioural data from experiment 2.

    exp3.csv. Behavioural data from experiment 3.

    Exp1&2.csv. Behavioural data comparing experiment 1 and 2.

    Exp1byDay.csv. Behavioural data for experiment 1 split by day.

    Exp2byDay.csv. Behavioural data for experiment 2 split by day.

    Exp3byDay.csv. Behavioural data for experiment 3 split by day.

    exp1.R. R code for experiment 1 analysis.

    exp2.R. R code for experiment 2 analysis.

    exp3.R. R code for experiment 3 analysis.

    exp1vsExp2.R. R code for comparing experiment 1 and 2.

  13. Helpful Life Tips from Reddit Dataset (13K Tips)

    • kaggle.com
    zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asaniczka (2023). Helpful Life Tips from Reddit Dataset (13K Tips) [Dataset]. https://www.kaggle.com/datasets/asaniczka/helpful-life-tips-from-reddit-dataset-13k-tips
    Explore at:
    zip(4186644 bytes)Available download formats
    Dataset updated
    Oct 1, 2023
    Authors
    asaniczka
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Discover truly valuable life tips shared by real humans.

    About the Dataset:

    Reddit is a treasure trove of genuine life experiences from millions of people. Subreddits like r/lifeProTips and r/YouShouldKnow are well-known for containing some of the best and most practical tips that anyone can apply to their life.

    This dataset is a cleaned version of the split reddit dump by u/Watchful1.

    Each row in the dataset contains a helpful life tip.

    Interesting Task Ideas:

    1. Develop a web app that presents users with an interesting tip each day.
    2. Explore the data to determine the most popular types of tips.
    3. Build a recommendation system that suggests relevant tips based on specific life situations or topics.
    4. Develop AI-powered models that generate useful life tips using the examples in the dataset.
    5. Analyze popular life topics and their corresponding tips to uncover patterns and common themes.

    If you find this dataset valuable, don't forget to hit the upvote button! 😊💝

    Checkout my other datasets

    Gender Wage Gap in the USA

    USA Hispanic-White Wage Gap Dataset

    USA Unemployment Rates by Demographics & Race

    USA Wage Comparison for College vs. High School

    Employment-to-Population Ratio for USA

  14. d

    Dataset and R code: Genetic diversity of lion populations in Kenya:...

    • search.dataone.org
    • datadryad.org
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mumbi Chege (2025). Dataset and R code: Genetic diversity of lion populations in Kenya: evaluating past management practices and recommendations for future conservation actions by Chege M et.al [Dataset]. http://doi.org/10.5061/dryad.s4mw6m9d8
    Explore at:
    Dataset updated
    Jul 28, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Mumbi Chege
    Description

    The decline of lions (Panthera leo) in Kenya has raised conservation concerns on their overall population health and long-term survival. This study aimed to assess the genetic structure, differentiation, and diversity of lion populations in the country, while considering the influence of past management practices. Using a lion-specific Single Nucleotide Polymorphism (SNP) panel, we genotyped 171 individuals from 12 populations representative of areas with permanent lion presence. Our results revealed a distinct genetic pattern with pronounced population structure, confirmed a north-south split, and found no indication of inbreeding in any of the tested populations. Differentiation seems to be primarily driven by geographical barriers, human presence, and climatic factors, but management practices may have also affected the observed patterns. Notably, the Tsavo population displayed evidence of admixture, perhaps attributable to its geographic location as a suture zone, vast size, or to p..., This dataset was obtained from 12 kenyan lion populations. After DNA extraction, SNP genotyping was performed using an allele-specific KASP technique. The attached datasets includes the .txt and .str versions of the autosomal SNPs to aid in reproducing the results.  , , # dataset and r code associated with the publication entitled "Genetic diversity of lion populations in Kenya: evaluating past management practices and recommendations for future conservation actions" by Chege M et.al.

    https://doi.org/10.5061/dryad.s4mw6m9d8

    Â Â Â We provide the following description of the dataset and scripts for analysis carried out in R: We have split the data and scripts for ease of reference i.e.,

     1.) Script 1: titled ‘***Calc_He_Ho_Ar_Fis’***. For calculating the genetic diversity indices i.e. allelic richness (AR), Private alleles (AP), Inbreeding coefficients (FIS), expected (HE) and observed heterozygosity (HO). This script uses:

    • **“data_HoHeAr.txt†** dataset. This dataset has information on individual samples, including their geographical area (population) of origin and the corresponding 335 autosomal single nucleotide polymorphism (SNP) reads.

    • ‘***shompole2.txt’***  this bears the dataset from the Shompol...

  15. Video game pricing analytics dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivi Deveshwar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

    To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

    After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.

  16. riiid-group-dataset

    • kaggle.com
    zip
    Updated Dec 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tomoo inubushi (2020). riiid-group-dataset [Dataset]. https://kaggle.com/tomooinubushi/riiidgroupdataset
    Explore at:
    zip(2093413644 bytes)Available download formats
    Dataset updated
    Dec 29, 2020
    Authors
    tomoo inubushi
    Description

    import pandas as pd, numpy as np, seaborn as sns from sklearn.model_selection import GroupShuffleSplit import joblib

    train = pd.read_csv('/home/petmed/inu/kaggle/riid/train.csv', dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8', 'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly':'int8', 'prior_question_elapsed_time': 'float32', 'prior_question_had_explanation': 'boolean'} ) train = train[train.content_type_id == False] train = train.sort_values(['timestamp'], ascending=True) train.reset_index(drop=True, inplace=True) train['timestamp']=(train['timestamp']/1000).astype('int32') train['prior_question_elapsed_time'] = train['prior_question_elapsed_time'].fillna(25439.41) train['prior_question_had_explanation'] = train['prior_question_had_explanation'].fillna(False).astype('int8')

    train_group = train[['user_id','timestamp', 'content_id','answered_correctly','prior_question_elapsed_time','prior_question_had_explanation']].groupby('user_id').apply(lambda r: ( r['timestamp'].values, r['content_id'].values, r['answered_correctly'].values, r['prior_question_elapsed_time'].values, r['prior_question_had_explanation'].values)) joblib.dump(train_group, "/home/petmed/inu/kaggle/riid/train_group.pkl.zip")

    reduced_train_size=0.1 train_idx, test_idx =next(GroupShuffleSplit(n_splits=1, train_size=reduced_train_size, random_state=42).split(train,groups=train.user_id)) train_sub=train.iloc[train_idx] train_group = train_sub[['user_id','timestamp', 'content_id','answered_correctly','prior_question_elapsed_time','prior_question_had_explanation']].groupby('user_id').apply(lambda r: ( r['timestamp'].values, r['content_id'].values, r['answered_correctly'].values, r['prior_question_elapsed_time'].values, r['prior_question_had_explanation'].values)) joblib.dump(train_group, "/home/petmed/inu/kaggle/riid/train_group01.pkl.zip")

    reduced_train_size=0.5 train_idx, test_idx =next(GroupShuffleSplit(n_splits=1, train_size=reduced_train_size, random_state=42).split(train,groups=train.user_id)) train_sub=train.iloc[train_idx] train_group = train_sub[['user_id','timestamp', 'content_id','answered_correctly','prior_question_elapsed_time','prior_question_had_explanation']].groupby('user_id').apply(lambda r: ( r['timestamp'].values, r['content_id'].values, r['answered_correctly'].values, r['prior_question_elapsed_time'].values, r['prior_question_had_explanation'].values)) joblib.dump(train_group, "/home/petmed/inu/kaggle/riid/train_group05.pkl.zip")

  17. d

    Data from: imageseg: An R package for deep learning-based image segmentation...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Aug 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Niedballa; Jan Axtner; Timm Döbert; Andrew Tilker; An Nguyen; Seth Wong; Christian Fiderer; Marco Heurich; Andreas Wilting (2022). imageseg: An R package for deep learning-based image segmentation [Dataset]. http://doi.org/10.5061/dryad.x0k6djhnj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 6, 2022
    Dataset provided by
    Dryad
    Authors
    Jürgen Niedballa; Jan Axtner; Timm Döbert; Andrew Tilker; An Nguyen; Seth Wong; Christian Fiderer; Marco Heurich; Andreas Wilting
    Time period covered
    Jul 19, 2022
    Description
    1. Convolutional neural networks (CNNs) and deep learning are powerful and robust tools for ecological applications, and are particularly suited for image data. Image segmentation (the classification of all pixels in images) is one such application and can for example be used to assess forest structural metrics. While CNN-based image segmentation methods for such applications have been suggested, widespread adoption in ecological research has been slow, likely due to technical difficulties in implementation of CNNs and lack of toolboxes for ecologists.
    2. Here, we present R package imageseg which implements a CNN-based workflow for general-purpose image segmentation using the U-Net and U-Net++ architectures in R. The workflow covers data (pre)processing, model training, and predictions. We illustrate the utility of the package with image recognition models for two forest structural metrics: tree canopy density and understory vegetation density. We trained the models using large and dive...
  18. a

    Data from: 15 9 3

    • chatham-county-planning-subdivisions-and-rezonings-chathamncgis.hub.arcgis.com
    Updated Apr 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chatham County GIS Portal (2024). 15 9 3 [Dataset]. https://chatham-county-planning-subdivisions-and-rezonings-chathamncgis.hub.arcgis.com/datasets/15-9-3
    Explore at:
    Dataset updated
    Apr 16, 2024
    Dataset authored and provided by
    Chatham County GIS Portal
    Description

    Attachment regarding a request by Strata Solar for a Conditional Use Permit on Parcel No. 12233, located of US 64 W, Hickory Mountain Township, for a solar farm on approximately 42 acres. The parcel is split between R-1 zoning and unzoned. The R-1 zoning is the portion subject to this CUP request which is approximately 23.3 acres.

  19. Data from: A split sex ratio in solitary and social nests of a facultatively...

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam R. Smith; Karen M. Kapheim; Callum J. Kingwell; William T. Wcislo; Adam R. Smith; Karen M. Kapheim; Callum J. Kingwell; William T. Wcislo (2022). Data from: A split sex ratio in solitary and social nests of a facultatively social bee [Dataset]. http://doi.org/10.5061/dryad.62dt334
    Explore at:
    Dataset updated
    May 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Adam R. Smith; Karen M. Kapheim; Callum J. Kingwell; William T. Wcislo; Adam R. Smith; Karen M. Kapheim; Callum J. Kingwell; William T. Wcislo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A classic prediction of kin selection theory is that a mixed population of social and solitary nests of haplodiploid insects should exhibit a split sex ratio among offspring: female biased in social nests, male biased in solitary nests. Here we provide the first evidence of a solitary-social split sex ratio, using the sweat bee Megalopta genalis (Halictidae). Data from 2502 offspring collected from naturally occurring nests across six years spanning the range of the M. genalis reproductive season show that despite significant yearly and seasonal variation, the offspring sex ratio of social nests is consistently more female biased than in solitary nests. This suggests that split sex ratios may facilitate the evolutionary origins of cooperation based on reproductive altruism via kin selection.

  20. Runtime of implementations on Pfam seed and full.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha Petti; Sean R. Eddy (2023). Runtime of implementations on Pfam seed and full. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009492.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Samantha Petti; Sean R. Eddy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The runtime benchmarks were obtained by running each algorithm on the seed and full multi-MSAs Pfam-A.seed and Pfam-A.full on 2 cores with 8 GB RAM for the seed alignments and on 3 cores with 12 GB RAM for the full alignments. We did not compute the maximum runtime of the Blue algorithm; the algorithm failed to terminate within 6 days for 34 families.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu