21 datasets found
  1. f

    Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. S171

    • zenodo.org
    tar
    Updated Oct 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Zurowietz; Martin Zurowietz (2020). S171 [Dataset]. http://doi.org/10.5281/zenodo.3603809
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Zurowietz; Martin Zurowietz
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A fully annotated subset of the SO242/2_171-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

    • anemone
    • coral
    • crustacean
    • ipnops fish
    • litter
    • ophiuroid
    • other fauna
    • sea cucumber
    • sponge
    • stalked crinoid

    For a definition of the classes see [1].

    Related datasets:

    This dataset contains the following files:

    • annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.
    • annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.
    • images/: Directory that contains all the original image files.
    • dataset.json: JSON file that contains information about the dataset.
      • name: The name of the dataset.
      • images_dir: Name of the directory that contains the original image files.
      • metadata_file: Path to the CSV file that contains image metadata.
      • test_annotations_file: Path to the CSV file that contains the test annotations.
      • train_annotations_file: Path to the CSV file that contains the train annotations.
      • annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.
      • crop_dimension: Edge length of an annotation or style patch in pixels.
    • metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
  3. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

  4. r

    Data from: Spaceborne GNSS-R for Sea Ice Classification Using Machine...

    • resodate.org
    Updated Dec 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongchao Zhu; Tingye Tao; Jiangyang Li; Kegen Yu; Lei Wang; Xiaochuan Qu; Shuiping Li; Maximilian Semmling; Jens Wickert (2021). Spaceborne GNSS-R for Sea Ice Classification Using Machine Learning Classifiers [Dataset]. http://doi.org/10.14279/depositonce-12822
    Explore at:
    Dataset updated
    Dec 13, 2021
    Dataset provided by
    DepositOnce
    Technische Universität Berlin
    Authors
    Yongchao Zhu; Tingye Tao; Jiangyang Li; Kegen Yu; Lei Wang; Xiaochuan Qu; Shuiping Li; Maximilian Semmling; Jens Wickert
    Description

    The knowledge of Arctic Sea ice coverage is of particular importance in studies of climate change. This study develops a new sea ice classification approach based on machine learning (ML) classifiers through analyzing spaceborne GNSS-R features derived from the TechDemoSat-1 (TDS-1) data collected over open water (OW), first-year ice (FYI), and multi-year ice (MYI). A total of eight features extracted from GNSS-R observables collected in five months are applied to classify OW, FYI, and MYI using the ML classifiers of random forest (RF) and support vector machine (SVM) in a two-step strategy. Firstly, randomly selected 30% of samples of the whole dataset are used as a training set to build classifiers for discriminating OW from sea ice. The performance is evaluated using the remaining 70% of samples through validating with the sea ice type from the Special Sensor Microwave Imager Sounder (SSMIS) data provided by the Ocean and Sea Ice Satellite Application Facility (OSISAF). The overall accuracy of RF and SVM classifiers are 98.83% and 98.60% respectively for distinguishing OW from sea ice. Then, samples of sea ice, including FYI and MYI, are randomly split into training and test dataset. The features of the training set are used as input variables to train the FYI-MYI classifiers, which achieve an overall accuracy of 84.82% and 71.71% respectively by RF and SVM classifiers. Finally, the features in every month are used as training and testing set in turn to cross-validate the performance of the proposed classifier. The results indicate the strong sensitivity of GNSS signals to sea ice types and the great potential of ML classifiers for GNSS-R applications.

  5. FacialRecognition

    • kaggle.com
    zip
    Updated Dec 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
    Explore at:
    zip(121674455 bytes)Available download formats
    Dataset updated
    Dec 1, 2016
    Authors
    TheNicelander
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    #https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

    ###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

    ###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

    ###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

    ###To look at samples of the data, uncomment this line:

    head(d.train)

    ###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

    im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

    im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

    ################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

    #strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

    ###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

    install.packages('foreach')

    library("foreach", lib.loc="~/R/win-library/3.3")

    ###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

    ###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

    save(d.train, im.train, d.test, im.test, file='data.Rd')

    load('data.Rd')

    #each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

    #im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

    #To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

    #Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

    #Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

    #there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

    #One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

    #To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

    #The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

    install.packages('reshape2')

    library(...

  6. t

    Data for binary classification experiments

    • researchdata.tuwien.at
    zip
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Kattenbeck; Antonia Golab; Antonia Golab; Negar Alinaghi; Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Markus Kattenbeck; Markus Kattenbeck; Markus Kattenbeck (2025). Data for binary classification experiments [Dataset]. http://doi.org/10.48436/zjkky-pgs18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Geoinformation, TU Wien
    Authors
    Markus Kattenbeck; Antonia Golab; Antonia Golab; Negar Alinaghi; Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Markus Kattenbeck; Markus Kattenbeck; Markus Kattenbeck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research context

    This zip archive contains all the data and scripts which are neccessary to reproduce the results of the following paper, co-authored by Markus Kattenbeck, Ioannis Giannopoulos, Negar Alinaghi, Antonia Golab, and Daniel R. Montello:

    Predicting spatial familiarity by exploiting head and eye movements during pedestrian navigation in the real world

    This paper will be published in Springer Nature Scientific Reports.

    File overview

    The structure of the archive is the following:

    • Folder "01_data" contains all the data files needed and a readme file describing the structure of each of these data files. These data files are:
      • lsp.csv [contains demographic data about participants]
      • matched_gaze_imu.csv [contains the segmented behavioral data, i.e. both gaze features and imu features]
      • matched_gaze_imu_feature_description.pdf [contains a description of the features contained in matched_gaze_imu.csv]
      • walking_dates.csv [contains an overview on which date participants walked the familiar and unfamiliar routes]
      • users_polygons.csv [contains one or more polygons per participant in which they are familiar]
      • polygons_markers.csv [contains locations of POIs per polygon for which participants reported to be familiar with]
      • user_routes.csv [containes the route participants provided between a randomly selected pair of POIs they have provided for a given polygon]
    • Folder "02_scripts" contains the data analysis scripts; they are organized in two subfolders:
      • 01_ml_scripts: these are the scripts for the XGBoost classification; they are organized as two python files in which further instructions for use are given.
        • 80_20_code.py is the python file which runs the ML experiments using an 80/20 train/test split
        • L5O4T_code.py is the python file which runs the ML experiments leaving the full data of five different participants per condition as unseen data for the test.
        • requirements.txt states the used Python package versions
      • 02_r_scripts:
        • cleaned_script.Rmd This is an R notebook which can be easily opened in R-Studio and provides the analysis scripts for the descriptive statistics presented in the paper.
        • package_versions.txt states the used R package versions

    Licenses

    The code is licensed under MIT, the data is licensed under CC-BY.

  7. h

    research_papers_short

    • huggingface.co
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sathish Kumar R (2024). research_papers_short [Dataset]. https://huggingface.co/datasets/pt-sk/research_papers_short
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2024
    Authors
    Sathish Kumar R
    Description

    Dataset Card

    This is a dataset containing ML ArXiv papers. The dataset is a version of the original one from CShorten, which is a part of the ArXiv papers dataset from Kaggle. Three steps are made to process the source data:

    useless columns removal; train-test split; ' ' removal and trimming spaces on sides of the text.

  8. f

    R script for training and testing C5.0 models (Marcos criteria).

    • figshare.com
    txt
    Updated Oct 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando De la Garza-Salazar; Brian Egenriether (2025). R script for training and testing C5.0 models (Marcos criteria). [Dataset]. http://doi.org/10.1371/journal.pone.0334829.s011
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 17, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Fernando De la Garza-Salazar; Brian Egenriether
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R script for training and testing C5.0 models (Marcos criteria).

  9. Glaucoma Dataset: EyePACS-AIROGS-light-V2

    • kaggle.com
    zip
    Updated Mar 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riley Kiefer (2024). Glaucoma Dataset: EyePACS-AIROGS-light-V2 [Dataset]. https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2/code
    Explore at:
    zip(549533071 bytes)Available download formats
    Dataset updated
    Mar 9, 2024
    Authors
    Riley Kiefer
    Description

    News: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.

    Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!

    Please cite the dataset and at least the first of my related works if you found this dataset useful!

    • Riley Kiefer. "EyePACS-AIROGS-light-V2". Kaggle, 2024, doi: 10.34740/KAGGLE/DSV/7802508.
    • Riley Kiefer. "EyePACS-AIROGS-light-V1". Kaggle, 2023, doi: 10.34740/kaggle/ds/3222646.
    • Riley Kiefer. "Standardized Multi-Channel Dataset for Glaucoma, v19 (SMDG-19)". Kaggle, 2023, doi: 10.34740/kaggle/ds/2329670
    • Steen, J., Kiefer, R., Ardali, M., Abid, M. & Amjadian, E. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications. Invest. Ophthalmol. Vis. Sci. 64, 384–384 (2023).
    • Amjadian, E., Ardali, M. R., Kiefer, R., Abid, M. & Steen, J. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection. Invest. Ophthalmol. Vis. Sci. 64, 392–392 (2023).
    • R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429.
    • Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023.
    • R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.
    • E. Amjadian, R. Kiefer, J. Steen, M. Abid, M. Ardali, "A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection". American Academy of Optometry. 2022.

    Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected

    Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

    Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].

    About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...

  10. Table1_Function-Wise Dual-Omics analysis for radiation pneumonitis...

    • frontiersin.figshare.com
    bin
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bing Li; Ge Ren; Wei Guo; Jiang Zhang; Sai-Kit Lam; Xiaoli Zheng; Xinzhi Teng; Yunhan Wang; Yang Yang; Qinfu Dan; Lingguang Meng; Zongrui Ma; Chen Cheng; Hongyan Tao; Hongchang Lei; Jing Cai; Hong Ge (2023). Table1_Function-Wise Dual-Omics analysis for radiation pneumonitis prediction in lung cancer patients.DOCX [Dataset]. http://doi.org/10.3389/fphar.2022.971849.s001
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Bing Li; Ge Ren; Wei Guo; Jiang Zhang; Sai-Kit Lam; Xiaoli Zheng; Xinzhi Teng; Yunhan Wang; Yang Yang; Qinfu Dan; Lingguang Meng; Zongrui Ma; Chen Cheng; Hongyan Tao; Hongchang Lei; Jing Cai; Hong Ge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose: This study investigates the impact of lung function on radiation pneumonitis prediction using a dual-omics analysis method.Methods: We retrospectively collected data of 126 stage III lung cancer patients treated with chemo-radiotherapy using intensity-modulated radiotherapy, including pre-treatment planning CT images, radiotherapy dose distribution, and contours of organs and structures. Lung perfusion functional images were generated using a previously developed deep learning method. The whole lung (WL) volume was divided into function-wise lung (FWL) regions based on the lung perfusion functional images. A total of 5,474 radiomics features and 213 dose features (including dosiomics features and dose-volume histogram factors) were extracted from the FWL and WL regions, respectively. The radiomics features (R), dose features (D), and combined dual-omics features (RD) were used for the analysis in each lung region of WL and FWL, labeled as WL-R, WL-D, WL-RD, FWL-R, FWL-D, and FWL-RD. The feature selection was carried out using ANOVA, followed by a statistical F-test and Pearson correlation test. Thirty times train-test splits were used to evaluate the predictability of each group. The overall average area under the receiver operating characteristic curve (AUC), accuracy, precision, recall, and f1-score were calculated to assess the performance of each group.Results: The FWL-RD achieved a significantly higher average AUC than the WL-RD group in the training (FWL-RD: 0.927 ± 0.031, WL-RD: 0.849 ± 0.064) and testing cohorts (FWL-RD: 0.885 ± 0.028, WL-RD: 0.762 ± 0.053, p < 0.001). When using radiomics features only, the FWL-R group yielded a better classification result than the model trained with WL-R features in the training (FWL-R: 0.919 ± 0.036, WL-R: 0.820 ± 0.052) and testing cohorts (FWL-R: 0.862 ± 0.028, WL-R: 0.750 ± 0.057, p < 0.001). The FWL-D group obtained an average AUC of 0.782 ± 0.032, obtaining a better classification performance than the WL-D feature-based model of 0.740 ± 0.028 in the training cohort, while no significant difference was observed in the testing cohort (FWL-D: 0.725 ± 0.064, WL-D: 0.710 ± 0.068, p = 0.54).Conclusion: The dual-omics features from different lung functional regions can improve the prediction of radiation pneumonitis for lung cancer patients under IMRT treatment. This function-wise dual-omics analysis method holds great promise to improve the prediction of radiation pneumonitis for lung cancer patients.

  11. Housing Price Prediction using DT and RF in R

    • kaggle.com
    zip
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
    Explore at:
    zip(629100 bytes)Available download formats
    Dataset updated
    Aug 31, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Objective: To predict the prices of houses in the City of Melbourne
    • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
    • Data Cleaning:
    • Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
    • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
    • We remove 11566 records which have missing values
    • We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
    • We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
    • Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
    • Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
    • Average price for 5464 houses is $1084349
    • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
    • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
    • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
    • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
    • Variables ‘postcode’, longitude and building are the most important variables
    • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
    • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
    • The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
    • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
    • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
    • We tune the model and find mtry = 3 has the lowest out of bag error
    • We use the caret package and use 5 fold cross validation technique
    • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
    • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
    • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
  12. h

    sickr-sts

    • huggingface.co
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiheng Su (2025). sickr-sts [Dataset]. https://huggingface.co/datasets/Samsoup/sickr-sts
    Explore at:
    Dataset updated
    Oct 31, 2025
    Authors
    Yiheng Su
    Description

    Samsoup/sickr-sts

    This dataset is derived from mteb/sickr-sts (SICK-R style semantic textual similarity), which in MTEB is provided as a single split. This script shuffles that split deterministically and produces train / validation / test = 70% / 20% / 10%. Fields

    sentence1 — first sentence sentence2 — second sentence score — similarity / relatedness score (float32)

    Processing

    Input: single split from mteb/sickr-sts Shuffle with a fixed seed 70/20/10 partition Keep only… See the full description on the dataset page: https://huggingface.co/datasets/Samsoup/sickr-sts.

  13. f

    Distribution of bounding boxes for each sedimentary structure across...

    • figshare.com
    xlsx
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz (2025). Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-III. It confirms that all classes are represented, supporting fair performance evaluation despite observed precision drops. [Dataset]. http://doi.org/10.1371/journal.pone.0327738.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-III. It confirms that all classes are represented, supporting fair performance evaluation despite observed precision drops.

  14. Mathematics Dataset

    • github.com
    • opendatalab.com
    • +1more
    Updated Apr 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepMind (2019). Mathematics Dataset [Dataset]. https://github.com/Wikidepia/mathematics_dataset_id
    Explore at:
    Dataset updated
    Apr 3, 2019
    Dataset provided by
    DeepMindhttp://deepmind.com/
    Description

    This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

    ## Example questions

     Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
     Answer: 4
     
     Question: Calculate -841880142.544 + 411127.
     Answer: -841469015.544
     
     Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
     Answer: 54*a - 30
    

    It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:

    • algebra (linear equations, polynomial roots, sequences)
    • arithmetic (pairwise operations and mixed expressions, surds)
    • calculus (differentiation)
    • comparison (closest numbers, pairwise comparisons, sorting)
    • measurement (conversion, working with time)
    • numbers (base conversion, remainders, common divisors and multiples, primality, place value, rounding numbers)
    • polynomials (addition, simplification, composition, evaluating, expansion)
    • probability (sampling without replacement)
  15. Mask R-CNN Pedestrian Tracklets

    • kaggle.com
    zip
    Updated May 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petr Pulc (2021). Mask R-CNN Pedestrian Tracklets [Dataset]. https://www.kaggle.com/petrpulc/mask-rcnn-people-tracklets
    Explore at:
    zip(6018060706 bytes)Available download formats
    Dataset updated
    May 30, 2021
    Authors
    Petr Pulc
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Why?

    Object tracking, or more precisely the re-identification of objects in video streams, relies more and more on deep convolutional and residual networks. And they require a lot of good training data. Moreover, we want to show that including object mask in the alpha channel may pose additional benefits in object re-identification.

    What?

    The dataset was constructed by crunching image sequences from the Multiple Object Tracking Challenge 2016/7 dataset (they differ only in provided detections and ground truth, neither of which is used here). As a bonus, I have taken a random YouTube video in high resolution with people walking around (youtu.be/NEfxRHeb-70) and extracted five tracklets from there. Mask R-CNN provides a proposal of object mask which is stored in the alpha channel in

    Files are organised similarly as in the MARS dataset, one of the most prevalent in object re-identification learning. Just a couple of notes here: - Images are actually in four-channel PNG (RGBA) with aspect ratio 1:2, object centred in the bounding box, padded with zeros. - Opposed to MARS, each tracklet is considered a new sequence. This may be suboptimal as the same person can be in multiple tracklets. - The train/test split is approx. 50:50, IDs do not overlap.

  16. OrbNet Denali Training Data

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    application/x-gzip
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anders S. Christensen; Sai Krishna Sirumalla; Zhuoran Qiao; Michael B. O'Connor; Daniel G. A. Smith; Feizhi Ding; Peter J. Bygrave; Animashree Anandkumar; Matthew Welborn; Frederick R. Manby; Thomas F. Miller III (2023). OrbNet Denali Training Data [Dataset]. http://doi.org/10.6084/m9.figshare.14883867.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Anders S. Christensen; Sai Krishna Sirumalla; Zhuoran Qiao; Michael B. O'Connor; Daniel G. A. Smith; Feizhi Ding; Peter J. Bygrave; Animashree Anandkumar; Matthew Welborn; Frederick R. Manby; Thomas F. Miller III
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OrbNet Denali Training Data This repository contains the data for the paper "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy". The data set consists of geometries of molecules and the corresponding energy labels calculated and the DFT and semi-empirical level. Citation Anders S. Christensen(1,a), Sai Krishna Sirumalla(1,a), Zhuoran Qiao(2), Michael B. O'Connor(1), Daniel G. A. Smith(1), Feizhi Ding(1), Peter J. Bygrave(1), Animashree Anandkumar(3,4), Matthew Welborn(1), Frederick R. Manby(1), and Thomas F. Miller III(1,2) "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy" (2021) https://arxiv.org/abs/2107.00299 a) Indicates equal contribution Entos, Inc., Los Angeles, CA 90027, USADivision of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125, USADivision of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA 91125, USANVIDIA, Santa Clara, CA 95051, USA Contents The following files are included:

    Filename Description MD5checksum

    denali_labels.tar.gz .csv file with energy labels and other metadata bc9b612f75373d1d191ce7493eebfd62

    denali_xyz_files.tar.gz Archive with .xyz geometry files edd35e95a018836d5f174a3431a751df

    Geometry data The geometries are stored in XYZ+ format, which is compatible with a standard .xyz format, but additionally has the multiplicity and charges annotated in the comment line (2nd) line. The coordinates are in units of Ångstrøm. For example, a water molecule with a charge of 0 and a spin-multiplicity of 1 (i.e. singlet) can be specified in this format as: 3 0 1 O -1.08201 1.07900 -0.02472 H -0.09268 1.08664 0.01745 H -1.37137 1.24781 0.90715

    The directory structure of the geometry data contained within denali_xyz_files.tar.gz is as follows: xyz_files/ ├── mol_id1/ │ ├──sample_id0.xyz │ ├──sample_id1.xyz │ ├──sample_id2.xyz │ ├──sample_id3.xyz │ └──sample_id4.xyz ├── mol_id2/ │ ├──sample_id0.xyz │ ├──sample_id1.xyz │ ├──sample_id2.xyz │ └──sample_id3.xyz ├── ... etc

    Each uniquely identifies a molecule, with the various conformer geometries for that molecule stored in the corresponding folder. Those geometries are in turn identified by a unique identifier. Grouping the geometries by is used in the OrbNet loss-function, see the Eqn. 3 in the paper. Note that not all molecules has multiple geometries. Training labels The training labels (i.e. the wB97X-D3/def2-TZVP and GFN1-xTB energies) and the training and test/validation splits are provided in the file denali_labels.csv in units of Hartree. All molecules are singlet states. The .csv file contains the following columns:

    Column Description

    sample_id A unique hash generated from the QM input, also corresponds to the .xyz filename of that geometry

    subset The data source for that geometry, please refer to the paper for a detailed description of the various subsets

    mol_id Identifier for the parent molecule

    test_set True if the geometry is part of the test/validation set of neutral molecules

    test_set_plus True if the geometry is part of the test/validation set of charged molecules

    prelim_1 True if the geometry is part of the 10% OrbNet Denali training set

    training_set_plus True if the geometry is part of the full OrbNet Denali training set

    charge The charge of the molecule

    dft_energy wB97X-D3/def2-TZVP energy calculated with Qcore 0.8.17 in Hartree

    xtb1_energy GFN1-xTB energy calculated with Qcore 0.8.17 in Hartree

    The .csv file can be loaded in python, for example using Pandas.

  17. s

    RRegrs study for Growth Yield

    • portalcientifico.sergas.gal
    • figshare.com
    Updated 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Munteanu, Cristian Robert; Munteanu, Cristian Robert (2016). RRegrs study for Growth Yield [Dataset]. https://portalcientifico.sergas.gal/documentos/668fc448b9e7c03b01bd8a9e
    Explore at:
    Dataset updated
    2016
    Authors
    Munteanu, Cristian Robert; Munteanu, Cristian Robert
    Description

    RRegrs study for Growth Yield for original and corrected/filterred datasets: inputs training and test files, R scripts to split the datasets, plot for outlier removal.

  18. Fish dataset

    • kaggle.com
    zip
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AbdElRahman16 (2025). Fish dataset [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/11111111111111111111
    Explore at:
    zip(20477 bytes)Available download formats
    Dataset updated
    May 6, 2025
    Authors
    AbdElRahman16
    Description

    🔍 Dataset Overview: 🐟 Species: Name of the fish species (e.g., Anabas testudineus)

    📏 Length: Length of the fish (in centimeters)

    ⚖️ Weight: Weight of the fish (in grams)

    🧮 W/L Ratio: Weight-to-length ratio of the fish

    🧠 Steps to Build the Prediction Model: 📋 Data Preprocessing: 1 - Handle Missing Values: Check for and handle any missing values appropriately using methods like:

    Imputation (mean/median for numeric data)

    Row or column removal (if data is too sparse)

    2 - Convert Data Types: Ensure numerical columns (Length, Weight, W/L Ratio) are in the correct numeric format.

    3 - Handle Categorical Variables: Convert the Species column into numerical format using:

    One-Hot Encoding

    Label Encoding

    🎯 Feature Selection: 1 - Correlation Analysis: Use correlation heatmaps or statistical tests to identify features most related to the target variable (e.g., Weight).

    2 - Feature Importance: Use tree-based models (like Random Forest) to determine which features are most predictive.

    🔍 Model Selection: 1 - Algorithm Choice: Choose suitable machine learning algorithms such as:

    Linear Regression

    Decision Tree Regressor

    Random Forest Regressor

    Gradient Boosting Regressor

    2 - Model Comparison: Evaluate each model using metrics like:

    Mean Absolute Error (MAE)

    Mean Squared Error (MSE)

    R-squared (R²)

    🚀 Model Training and Evaluation: 1 - Train the Model: Split the dataset into training and testing sets (e.g., 80/20 split). Train the selected model(s) on the training set.

    2 - Evaluate the Model: Use the test set to assess model performance and fine-tune as necessary using grid search or cross-validation.

    This dataset and workflow are useful for exploring biometric relationships in fish and building regression models to predict weight based on length or species. Great for marine biology, aquaculture analytics, and educational projects.

    🐠 Happy modeling! 👍 Please upvote if you found this helpful!

    https://www.kaggle.com/code/abdelrahman16/fish-clustering-diverse-techniques

  19. EternaBrain CNN accuracies on eternamoves-select with different splits of...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan V. Koodli; Benjamin Keep; Katherine R. Coppess; Fernando Portela; Rhiju Das (2023). EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1007059.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rohan V. Koodli; Benjamin Keep; Katherine R. Coppess; Fernando Portela; Rhiju Das
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets.

  20. Detailed breakdown of overfitting comparison of CARRoT output and the other...

    • plos.figshare.com
    txt
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alina Bazarova; Marko Raseta (2023). Detailed breakdown of overfitting comparison of CARRoT output and the other models. [Dataset]. http://doi.org/10.1371/journal.pone.0292597.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alina Bazarova; Marko Raseta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overfitting in terms of absolute/relative error, accuracy/AUROC and accuracy only (for continuous, binary and multinomial outcomes respectively) computed both on training and test sets of different prediction methods on 43 datasets available in R using the default 90%/10% training/validation split. The methods used are CARRoT with EPV = 10, model, based on significant predictors only, lasso-based model, CARRoT with EPV = 10 and additional R2 constraint. (CSV)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu