76 datasets found
  1. f

    Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. R

    Replication data for: "Split Decisions: Household Finance When a Policy...

    • dataverse.iza.org
    • dataverse.harvard.edu
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael A. Clemens; Michael A. Clemens; Erwin R. Tiongson; Erwin R. Tiongson (2024). Replication data for: "Split Decisions: Household Finance When a Policy Discontinuity Allocates Overseas Work" [Dataset]. http://doi.org/10.7910/DVN/2DO8QP
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Research Data Center of IZA (IDSC)
    Authors
    Michael A. Clemens; Michael A. Clemens; Erwin R. Tiongson; Erwin R. Tiongson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Clemens, Michael A., and Tiongson, Erwin R., (2017) "Split Decisions: Household Finance When a Policy Discontinuity Allocates Overseas Work." Review of Economics and Statistics 99:3, 531-543.

  3. Data from: Regression with Empirical Variable Selection: Description of a...

    • plos.figshare.com
    txt
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

  4. d

    Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States, United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  5. Glaucoma Dataset: EyePACS-AIROGS-light-V2

    • kaggle.com
    zip
    Updated Mar 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riley Kiefer (2024). Glaucoma Dataset: EyePACS-AIROGS-light-V2 [Dataset]. https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2/code
    Explore at:
    zip(549533071 bytes)Available download formats
    Dataset updated
    Mar 9, 2024
    Authors
    Riley Kiefer
    Description

    News: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.

    Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!

    Please cite the dataset and at least the first of my related works if you found this dataset useful!

    • Riley Kiefer. "EyePACS-AIROGS-light-V2". Kaggle, 2024, doi: 10.34740/KAGGLE/DSV/7802508.
    • Riley Kiefer. "EyePACS-AIROGS-light-V1". Kaggle, 2023, doi: 10.34740/kaggle/ds/3222646.
    • Riley Kiefer. "Standardized Multi-Channel Dataset for Glaucoma, v19 (SMDG-19)". Kaggle, 2023, doi: 10.34740/kaggle/ds/2329670
    • Steen, J., Kiefer, R., Ardali, M., Abid, M. & Amjadian, E. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications. Invest. Ophthalmol. Vis. Sci. 64, 384–384 (2023).
    • Amjadian, E., Ardali, M. R., Kiefer, R., Abid, M. & Steen, J. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection. Invest. Ophthalmol. Vis. Sci. 64, 392–392 (2023).
    • R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429.
    • Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023.
    • R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.
    • E. Amjadian, R. Kiefer, J. Steen, M. Abid, M. Ardali, "A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection". American Academy of Optometry. 2022.

    Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected

    Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

    Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].

    About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...

  6. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

  7. Data from: A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image...

    • zenodo.org
    • data-staging.niaid.nih.gov
    zip
    Updated Feb 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roser Viñals; Roser Viñals; Jean-Philippe Thiran; Jean-Philippe Thiran (2024). A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (2/6) [Dataset]. http://doi.org/10.5281/zenodo.10591473
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roser Viñals; Roser Viñals; Jean-Philippe Thiran; Jean-Philippe Thiran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.

    Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 2.

    Structure

    In Vivo Data

    • Number of Acquisitions: 20,000

    • Volunteers: Nine volunteers

    • File Structure: Each volunteer's data is compressed in a separate zip file.

      • Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.
    • Regions :

      • Abdomen: 6599 acquisitions
      • Neck: 3294 acquisitions
      • Breast: 3291 acquisitions
      • Lower limbs: 2616 acquisitions
      • Upper limbs: 2110 acquisitions
      • Back: 2090 acquisitions
    • File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.

    In Vitro Data

    • Number of Acquisitions: 32 from CIRS model 054G phantom
    • File Structure: The in vitro data is compressed in the cirs-phantom.zip file.
    • File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.

    CSV Files

    Two CSV files are provided:

    • invivo_dataset.csv :

      • Contains a list of all in vivo acquisitions.
      • Columns: id, path, volunteer id, body region.
    • invitro_dataset.csv :

      • Contains a list of all in vitro acquisitions.
      • Columns: id, path

    Zenodo dataset splits and files

    The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 2nd split.

    File nameSizeZenodo subdataset number
    invivo_dataset.csv995.9 kB1
    invitro_dataset.csv1.1 kB1
    cirs-phantom.zip418.2 MB1
    volunteer-1-lowerLimbs.zip29.7 GB1
    volunteer-1-carotids.zip8.8 GB1
    volunteer-1-back.zip7.1 GB1
    volunteer-1-abdomen.zip34.0 GB2
    volunteer-1-breast.zip15.7 GB2
    volunteer-1-upperLimbs.zip25.0 GB3
    volunteer-2.zip26.5 GB4
    volunteer-3.zip20.3 GB3
    volunteer-4.zip24.1 GB5
    volunteer-5.zip6.5 GB5
    volunteer-6.zip11.5 GB5
    volunteer-7.zip11.1 GB6
    volunteer-8.zip21.2 GB6
    volunteer-9.zip23.2 GB4

    Normalized RF Images

    • Beamforming:

      • Depth from 1 mm to 55 mm

      • Width spanning the probe aperture

      • Grid: 𝜆/8 × 𝜆/8

      • Resulting images shape: 1483 × 1189

      • Two beamformed RF images from each acquisition:

        • Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)
        • Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)
    • Normalization:

      • The two RF images have been normalized
    • To display the images:

      • Perform the envelop detection (to obtain the IQ images)
      • Log-compress (to obtain the B-mode images)
    • File Format: Saved in npy format, loadable using Python and numpy.load(file).

    Training and Validation Split in the paper

    For the volunteer-based split used in the paper:

    • Training set: volunteers 1, 2, 3, 6, 7, 9
    • Validation set: volunteer 4
    • Test set: volunteers 5, 8
    • Images analyzed in the paper
      • Carotid acquisition (from volunteer 5): acquisition_12397
      • Back acquisition (from volunteer 8): acquisition_19764
      • In vitro acquisition: invitro-00030

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Please cite the original paper when using this dataset :

    Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256

    Contact

    For inquiries or issues related to this dataset, please contact:

    • Name: Roser Viñals
    • Email: roser.vinalsterres@epfl.ch
  8. Z

    A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image...

    • data-staging.niaid.nih.gov
    • zenodo.org
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viñals, Roser; Thiran, Jean-Philippe (2024). A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (5/6) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10591705
    Explore at:
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    École Polytechnique Fédérale de Lausanne
    Authors
    Viñals, Roser; Thiran, Jean-Philippe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.

    Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 5.

    Structure

    In Vivo Data

    Number of Acquisitions: 20,000

    Volunteers: Nine volunteers

    File Structure: Each volunteer's data is compressed in a separate zip file.

    Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.

    Regions :

    Abdomen: 6599 acquisitions

    Neck: 3294 acquisitions

    Breast: 3291 acquisitions

    Lower limbs: 2616 acquisitions

    Upper limbs: 2110 acquisitions

    Back: 2090 acquisitions

    File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.

    In Vitro Data

    Number of Acquisitions: 32 from CIRS model 054G phantom

    File Structure: The in vitro data is compressed in the cirs-phantom.zip file.

    File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.

    CSV Files

    Two CSV files are provided:

    invivo_dataset.csv :

    Contains a list of all in vivo acquisitions.

    Columns: id, path, volunteer id, body region.

    invitro_dataset.csv :

    Contains a list of all in vitro acquisitions.

    Columns: id, path

    Zenodo dataset splits and files

    The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 5th split.

    File name Size Zenodo subdataset number

    invivo_dataset.csv 995.9 kB 1

    invitro_dataset.csv 1.1 kB 1

    cirs-phantom.zip 418.2 MB 1

    volunteer-1-lowerLimbs.zip 29.7 GB 1

    volunteer-1-carotids.zip 8.8 GB 1

    volunteer-1-back.zip 7.1 GB 1

    volunteer-1-abdomen.zip 34.0 GB 2

    volunteer-1-breast.zip 15.7 GB 2

    volunteer-1-upperLimbs.zip 25.0 GB 3

    volunteer-2.zip 26.5 GB 4

    volunteer-3.zip 20.3 GB 3

    volunteer-4.zip 24.1 GB 5

    volunteer-5.zip 6.5 GB 5

    volunteer-6.zip 11.5 GB 5

    volunteer-7.zip 11.1 GB 6

    volunteer-8.zip 21.2 GB 6

    volunteer-9.zip 23.2 GB 4

    Normalized RF Images

    Beamforming:

    Depth from 1 mm to 55 mm

    Width spanning the probe aperture

    Grid: 𝜆/8 × 𝜆/8

    Resulting images shape: 1483 × 1189

    Two beamformed RF images from each acquisition:

    Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)

    Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)

    Normalization:

    The two RF images have been normalized

    To display the images:

    Perform the envelop detection (to obtain the IQ images)

    Log-compress (to obtain the B-mode images)

    File Format: Saved in npy format, loadable using Python and numpy.load(file).

    Training and Validation Split in the paper

    For the volunteer-based split used in the paper:

    Training set: volunteers 1, 2, 3, 6, 7, 9

    Validation set: volunteer 4

    Test set: volunteers 5, 8

    Images analyzed in the paper

    Carotid acquisition (from volunteer 5): acquisition_12397

    Back acquisition (from volunteer 8): acquisition_19764

    In vitro acquisition: invitro-00030

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Please cite the original paper when using this dataset :

    Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256

    Contact

    For inquiries or issues related to this dataset, please contact:

    Name: Roser Viñals

    Email: roser.vinalsterres@epfl.ch

  9. d

    Data from: Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice:...

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mason, Georgia; Walker, Michael (2023). Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice: Validating a split-plot design that promotes refinement and reduction [Dataset]. https://search.dataone.org/view/sha256%3A2b1ace7be31b90c0a2cf6859c8ec9dc108595d64d1ead30a0bfe0477100a52a8
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Mason, Georgia; Walker, Michael
    Time period covered
    May 1, 2013 - Aug 1, 2013
    Description

    Validating a novel housing method for inbred mice: mixed-strain housing. To see if this housing method affected strain-typical mouse phenotypes, if variance in the data was affected, and how statistical power was increased through this split-plot design.

  10. d

    Data from: FFT-split-operator code for solving the Dirac equation in 2+1...

    • elsevier.digitalcommonsdata.com
    Updated Jun 1, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guido R. Mocken (2008). FFT-split-operator code for solving the Dirac equation in 2+1 dimensions [Dataset]. http://doi.org/10.17632/43v3vvkwwf.1
    Explore at:
    Dataset updated
    Jun 1, 2008
    Authors
    Guido R. Mocken
    License

    https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/

    Description

    Abstract The main part of the code presented in this work represents an implementation of the split-operator method [J.A. Fleck, J.R. Morris, M.D. Feit, Appl. Phys. 10 (1976) 129-160; R. Heather, Comput. Phys. Comm. 63 (1991) 446] for calculating the time-evolution of Dirac wave functions. It allows to study the dynamics of electronic Dirac wave packets under the influence of any number of laser pulses and its interaction with any number of charged ion potentials. The initial wave function can be eith...

    Title of program: Dirac++ or (abbreviated) d++ Catalogue Id: AEAS_v1_0

    Nature of problem The relativistic time evolution of wave functions according to the Dirac equation is a challenging numerical task. Especially for an electron in the presence of high intensity laser beams and/or highly charged ions, this type of problem is of considerable interest to atomic physicists.

    Versions of this program held in the CPC repository in Mendeley Data AEAS_v1_0; Dirac++ or (abbreviated) d++; 10.1016/j.cpc.2008.01.042

    This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)

  11. Z

    A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image...

    • data-staging.niaid.nih.gov
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viñals, Roser; Thiran, Jean-Philippe (2024). A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (3/6) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10591693
    Explore at:
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    École Polytechnique Fédérale de Lausanne
    Authors
    Viñals, Roser; Thiran, Jean-Philippe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.

    Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 3.

    Structure

    In Vivo Data

    Number of Acquisitions: 20,000

    Volunteers: Nine volunteers

    File Structure: Each volunteer's data is compressed in a separate zip file.

    Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.

    Regions :

    Abdomen: 6599 acquisitions

    Neck: 3294 acquisitions

    Breast: 3291 acquisitions

    Lower limbs: 2616 acquisitions

    Upper limbs: 2110 acquisitions

    Back: 2090 acquisitions

    File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.

    In Vitro Data

    Number of Acquisitions: 32 from CIRS model 054G phantom

    File Structure: The in vitro data is compressed in the cirs-phantom.zip file.

    File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.

    CSV Files

    Two CSV files are provided:

    invivo_dataset.csv :

    Contains a list of all in vivo acquisitions.

    Columns: id, path, volunteer id, body region.

    invitro_dataset.csv :

    Contains a list of all in vitro acquisitions.

    Columns: id, path

    Zenodo dataset splits and files

    The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 3rd split.

    File name Size Zenodo subdataset number

    invivo_dataset.csv 995.9 kB 1

    invitro_dataset.csv 1.1 kB 1

    cirs-phantom.zip 418.2 MB 1

    volunteer-1-lowerLimbs.zip 29.7 GB 1

    volunteer-1-carotids.zip 8.8 GB 1

    volunteer-1-back.zip 7.1 GB 1

    volunteer-1-abdomen.zip 34.0 GB 2

    volunteer-1-breast.zip 15.7 GB 2

    volunteer-1-upperLimbs.zip 25.0 GB 3

    volunteer-2.zip 26.5 GB 4

    volunteer-3.zip 20.3 GB 3

    volunteer-4.zip 24.1 GB 5

    volunteer-5.zip 6.5 GB 5

    volunteer-6.zip 11.5 GB 5

    volunteer-7.zip 11.1 GB 6

    volunteer-8.zip 21.2 GB 6

    volunteer-9.zip 23.2 GB 4

    Normalized RF Images

    Beamforming:

    Depth from 1 mm to 55 mm

    Width spanning the probe aperture

    Grid: 𝜆/8 × 𝜆/8

    Resulting images shape: 1483 × 1189

    Two beamformed RF images from each acquisition:

    Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)

    Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)

    Normalization:

    The two RF images have been normalized

    To display the images:

    Perform the envelop detection (to obtain the IQ images)

    Log-compress (to obtain the B-mode images)

    File Format: Saved in npy format, loadable using Python and numpy.load(file).

    Training and Validation Split in the paper

    For the volunteer-based split used in the paper:

    Training set: volunteers 1, 2, 3, 6, 7, 9

    Validation set: volunteer 4

    Test set: volunteers 5, 8

    Images analyzed in the paper

    Carotid acquisition (from volunteer 5): acquisition_12397

    Back acquisition (from volunteer 8): acquisition_19764

    In vitro acquisition: invitro-00030

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Please cite the original paper when using this dataset :

    Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256

    Contact

    For inquiries or issues related to this dataset, please contact:

    Name: Roser Viñals

    Email: roser.vinalsterres@epfl.ch

  12. Val split & vocab file

    • kaggle.com
    zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devi Hemamalini R (2024). Val split & vocab file [Dataset]. https://www.kaggle.com/datasets/devihemamalinir/val-split-and-vocab-file
    Explore at:
    zip(1603266139 bytes)Available download formats
    Dataset updated
    Jul 6, 2024
    Authors
    Devi Hemamalini R
    Description

    Dataset

    This dataset was created by Devi Hemamalini R

    Contents

  13. Z

    Data from: Long-term spatial memory, across large spatial scales, in...

    • data.niaid.nih.gov
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priscila A Moura; Fletcher J Young; Monica Monllor; Marcio Z Cardoso; Stephen H Montgomery (2023). Long-term spatial memory, across large spatial scales, in Heliconius butterflies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7985235
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset provided by
    Departamento de Ecologia, Instituto de Biologia, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil
    Departamento de Ecologia, Universidade Federal do Rio Grande do Norte, Natal, RN, Brazil
    School of Biological Sciences, University of Bristol, Bristol, UK
    Authors
    Priscila A Moura; Fletcher J Young; Monica Monllor; Marcio Z Cardoso; Stephen H Montgomery
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data accompanying "Long-term spatial memory, across large spatial scales, in Heliconius butterflies", Current Biology 2023:

    exp1.csv. Behavioural data from experiment 1.

    exp2.csv. Behavioural data from experiment 2.

    exp3.csv. Behavioural data from experiment 3.

    Exp1&2.csv. Behavioural data comparing experiment 1 and 2.

    Exp1byDay.csv. Behavioural data for experiment 1 split by day.

    Exp2byDay.csv. Behavioural data for experiment 2 split by day.

    Exp3byDay.csv. Behavioural data for experiment 3 split by day.

    exp1.R. R code for experiment 1 analysis.

    exp2.R. R code for experiment 2 analysis.

    exp3.R. R code for experiment 3 analysis.

    exp1vsExp2.R. R code for comparing experiment 1 and 2.

  14. Multi Dataset Phishing

    • kaggle.com
    zip
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasir Hussein Shakir (2025). Multi Dataset Phishing [Dataset]. https://www.kaggle.com/datasets/yasserhessein/multi-dataset-phishing
    Explore at:
    zip(441159 bytes)Available download formats
    Dataset updated
    Oct 31, 2025
    Authors
    Yasir Hussein Shakir
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    1- The Zieni Dataset (2024): This is a recent, balanced dataset comprising 10,000 websites, with 5,000 phishing and 5,000 legitimate samples. The phishing URLs were sourced from PhishTank and Tranco, while legitimate URLs came from Alexa. Each of the 10,000 instances is characterized by 74 features, with 70 being numerical and 4 binary. These features comprehensively describe various components of a URL, including the domain, path, filename, and parameters.

    2- The UCI Phishing Websites Dataset: This dataset contains 11,055 website instances, each labeled as either phishing (1) or legitimate (-1). It provides 30 diverse features that capture address bar characteristics, domain-based attributes, and other HTML and JavaScript elements (e.g., prefix-suffix, google_index, iframe, https_token). The data was aggregated from several reputable sources, including the PhishTank and MillerSmiles archives.

    3- The Mendeley Phishing Dataset: This dataset includes 10,000 webpages, evenly split between phishing and legitimate categories. It describes each sample using 48 features. The data was collected in two periods: from January to May 2015 and from May to June 2017.

    References [1] R. Zieni, “Zieni dataset for Phishing detection,” vol. 1, 2024. doi: 10.17632/8MCZ8JSGNB.1. [2] R. Mohammad et al., “An assessment of features related to phishing websites using an automated technique,” in International Conference for Internet Technology and Secured Transactions, 2012. [3] C. L. Tan, “Phishing Dataset for Machine Learning: Feature Evaluation,” vol. 1, 2018. doi: 10.17632/H3CGNJ8HFT.1.

  15. o

    madelon

    • openml.org
    Updated May 22, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). madelon [Dataset]. https://www.openml.org/d/1485
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2015
    Description

    Author: Isabelle Guyon
    Source: UCI
    Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

    Abstract:

    MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

    Source:

    Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

    Data Set Information:

    MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

    This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

    There is no attribute information provided to avoid biasing the feature selection process.

    Relevant Papers:

    The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

    Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

    Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.

  16. Video game pricing analytics dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivi Deveshwar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

    To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

    After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.

  17. d

    Dataset and R code: Genetic diversity of lion populations in Kenya:...

    • search.dataone.org
    • datadryad.org
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mumbi Chege (2025). Dataset and R code: Genetic diversity of lion populations in Kenya: evaluating past management practices and recommendations for future conservation actions by Chege M et.al [Dataset]. http://doi.org/10.5061/dryad.s4mw6m9d8
    Explore at:
    Dataset updated
    Jul 28, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Mumbi Chege
    Description

    The decline of lions (Panthera leo) in Kenya has raised conservation concerns on their overall population health and long-term survival. This study aimed to assess the genetic structure, differentiation, and diversity of lion populations in the country, while considering the influence of past management practices. Using a lion-specific Single Nucleotide Polymorphism (SNP) panel, we genotyped 171 individuals from 12 populations representative of areas with permanent lion presence. Our results revealed a distinct genetic pattern with pronounced population structure, confirmed a north-south split, and found no indication of inbreeding in any of the tested populations. Differentiation seems to be primarily driven by geographical barriers, human presence, and climatic factors, but management practices may have also affected the observed patterns. Notably, the Tsavo population displayed evidence of admixture, perhaps attributable to its geographic location as a suture zone, vast size, or to p..., This dataset was obtained from 12 kenyan lion populations. After DNA extraction, SNP genotyping was performed using an allele-specific KASP technique. The attached datasets includes the .txt and .str versions of the autosomal SNPs to aid in reproducing the results.  , , # dataset and r code associated with the publication entitled "Genetic diversity of lion populations in Kenya: evaluating past management practices and recommendations for future conservation actions" by Chege M et.al.

    https://doi.org/10.5061/dryad.s4mw6m9d8

    Â Â Â We provide the following description of the dataset and scripts for analysis carried out in R: We have split the data and scripts for ease of reference i.e.,

     1.) Script 1: titled ‘***Calc_He_Ho_Ar_Fis’***. For calculating the genetic diversity indices i.e. allelic richness (AR), Private alleles (AP), Inbreeding coefficients (FIS), expected (HE) and observed heterozygosity (HO). This script uses:

    • **“data_HoHeAr.txt†** dataset. This dataset has information on individual samples, including their geographical area (population) of origin and the corresponding 335 autosomal single nucleotide polymorphism (SNP) reads.

    • ‘***shompole2.txt’***  this bears the dataset from the Shompol...

  18. UC San Diego Parkinson's Disease rsEEG Dataset

    • kaggle.com
    zip
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Basak Shuvo (2025). UC San Diego Parkinson's Disease rsEEG Dataset [Dataset]. https://www.kaggle.com/datasets/souravbasakshuvo/uc-san-diego-parkinsons-disease-resting-state-eeg/discussion
    Explore at:
    zip(424025470 bytes)Available download formats
    Dataset updated
    Jun 19, 2025
    Authors
    Sourav Basak Shuvo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    San Diego
    Description

    Welcome to the resting state EEG dataset collected at the University of San Diego and curated by Alex Rockhill at the University of Oregon.

    Please email arockhil@uoregon.edu before submitting a manuscript to be published in a peer-reviewed journal using this data, we wish to ensure that the data to be analyzed and interpreted with scientific integrity so as not to mislead the public about findings that may have clinical relevance. The purpose of this is to be responsible stewards of the data without an "available upon reasonable request" clause that we feel doesn't fully represent the open-source, reproducible ethos. The data is freely available to download so we cannot stop your publication if we don't support your methods and interpretation of findings, however, in being good data stewards, we would like to offer suggestions in the pre-publication stage so as to reduce conflict in published scientific literature. As far as credit, there is precedent for receiving a mention in the acknowledgements section for reading and providing feedback on the paper or, for more involved consulting, being included as an author may be warranted. The purpose of asking for this is not to inflate our number of authorships; we take ethical considerations of the best way to handle intellectual property in the form of manuscripts very seriously, and, again, sharing is at the discretion of the author although we strongly recommend it. Please be ethical and considerate in your use of this data and all open-source data and be sure to credit authors by citing them.

    An example of an analysis that we could consider problematic and would strongly advice to be corrected before submission to a publication would be using machine learning to classify Parkinson's patients from healthy controls using this dataset. This is because there are far too few patients for proper statistics. Parkinson's disease presents heterogeneously across patients, and, with a proper test-training split, there would be fewer than 8 patients in the testing set. Statistics on 8 or fewer patients for such a complicated diease would be inaccurate due to having too small of a sample size. Furthermore, if multiple machine learning algorithms were desired to be tested, a third split would be required to choose the best method, further lowering the number of patients in the testing set. We strongly advise against using any such approach because it would mislead patients and people who are interested in knowing if they have Parkinson's disease.

    Note that UPDRS rating scales were collected by laboratory personnel who had completed online training and not a board-certified neurologist. Results should be interpreted accordingly, especially that analyses based largely on these ratings should be taken with the appropriate amount of uncertainty.

    In addition to contacting the aforementioned email, please cite the following papers:

    Nicko Jackson, Scott R. Cole, Bradley Voytek, Nicole C. Swann. Characteristics of Waveform Shape in Parkinson's Disease Detected with Scalp Electroencephalography. eNeuro 20 May 2019, 6 (3) ENEURO.0151-19.2019; DOI: 10.1523/ENEURO.0151-19.2019.

    Swann NC, de Hemptinne C, Aron AR, Ostrem JL, Knight RT, Starr PA. Elevated synchrony in Parkinson disease detected with electroencephalography. Ann Neurol. 2015 Nov;78(5):742-50. doi: 10.1002/ana.24507. Epub 2015 Sep 2. PMID: 26290353; PMCID: PMC4623949.

    George JS, Strunk J, Mak-McCully R, Houser M, Poizner H, Aron AR. Dopaminergic therapy in Parkinson's disease decreases cortical beta band coherence in the resting state and increases cortical beta band power during executive control. Neuroimage Clin. 2013 Aug 8;3:261-70. doi: 10.1016/j.nicl.2013.07.013. PMID: 24273711; PMCID: PMC3814961.

    Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896).

    Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8.

    Note: see this discussion on the structure of the json files that is sufficient but not optimal and will hopefully be changed in future versions of BIDS: https://neurostars.org/t/behavior-metadata-without-tsv-event-data-related-to-a-neuroimaging-data/6768/25.

  19. h

    Vikhrmodels_Vikhr-Llama3.1-8B-Instruct-R-21-09-24-details

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). Vikhrmodels_Vikhr-Llama3.1-8B-Instruct-R-21-09-24-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/Vikhrmodels_Vikhr-Llama3.1-8B-Instruct-R-21-09-24-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of Vikhrmodels/Vikhr-Llama3.1-8B-Instruct-R-21-09-24

    Dataset automatically created during the evaluation run of model Vikhrmodels/Vikhr-Llama3.1-8B-Instruct-R-21-09-24 The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Vikhrmodels_Vikhr-Llama3.1-8B-Instruct-R-21-09-24-details.

  20. 2 yearly cubic smoothing spline of CH4 data from the WAIS-Divide ice core,...

    • doi.pangaea.de
    html, tsv
    Updated Jun 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael H Rhodes; Joseph R McConnell; Thomas Blunier; Edward J Brook; Daniele Romanini (2017). 2 yearly cubic smoothing spline of CH4 data from the WAIS-Divide ice core, Antarctica [Dataset]. http://doi.org/10.1594/PANGAEA.875981
    Explore at:
    html, tsvAvailable download formats
    Dataset updated
    Jun 7, 2017
    Dataset provided by
    PANGAEA
    Authors
    Rachael H Rhodes; Joseph R McConnell; Thomas Blunier; Edward J Brook; Daniele Romanini
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Dec 19, 2005
    Area covered
    Variables measured
    Gas age, Methane, DEPTH, ice/snow
    Description

    These data are made available as a comprehensive archive of WAIS-Divide methane measurements. In the majority of cases the 2-yearly spline fit, available to download from www.usap-dc.org (search for award # 600361), will be the most suitable for your application. The 2 yearly cubic smoothing spline fills gaps in the data set and reduces data set size, whilst also reducing noise (that could be noise due to the wider analytical system e.g., pressure fluctuations, or archival noise).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu