100+ datasets found
  1. h

    tiny-textbooks

    • huggingface.co
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2024). tiny-textbooks [Dataset]. http://doi.org/10.57967/hf/1126
    Explore at:
    Dataset updated
    Jan 26, 2024
    Authors
    Nam Pham
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Textbook-like Dataset: A High-Quality Resource for Small Language Models

    The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.

  2. R

    Small Ml Dataset

    • universe.roboflow.com
    zip
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Machine Learning (2024). Small Ml Dataset [Dataset]. https://universe.roboflow.com/machine-learning-opc17/small-dataset-ml/model/6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    Machine Learning
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Post Bounding Boxes
    Description

    Small Dataset Ml

    ## Overview
    
    Small Dataset Ml is a dataset for object detection tasks - it contains Post annotations for 571 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. h

    Open-ert-small-dataset

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Superb Emphasis (2025). Open-ert-small-dataset [Dataset]. https://huggingface.co/datasets/SuperbEmphasis/Open-ert-small-dataset
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Superb Emphasis
    Description

    This is a subset of: https://huggingface.co/datasets/openerotica/long-roleplay-v0.1 I am using mistral's new DEVSTRAL model to take the entire conversation in JSON format and rate it. I chose DEVSTRAL due to the mistral models being very consistent and well rounded. The Devstral model I was hoping could understand the JSON format a bit better. I ask the mode to rate each RP based on many different factors including grammar, prose, length (And a few others I will keep to myself :D). I then… See the full description on the dataset page: https://huggingface.co/datasets/SuperbEmphasis/Open-ert-small-dataset.

  4. R

    Object Small Dataset

    • universe.roboflow.com
    zip
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    + (2025). Object Small Dataset [Dataset]. https://universe.roboflow.com/-gajxq/object-small/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 14, 2025
    Dataset authored and provided by
    +
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Object Small Bounding Boxes
    Description

    Object Small

    ## Overview
    
    Object Small is a dataset for object detection tasks - it contains Object Small annotations for 4,165 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. h

    wikipedia-small-3000-embedded

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Hafedh Hichri
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

    load dataset in streaming mode (no download and it's fast)

    dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

    select 3000 samples

    from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.

  6. f

    Performance of ML models on test data.

    • plos.figshare.com
    xls
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

  7. R

    Tank Small Dataset

    • universe.roboflow.com
    zip
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    test (2025). Tank Small Dataset [Dataset]. https://universe.roboflow.com/test-nbp8j/tank-small/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2025
    Dataset authored and provided by
    test
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Tank Small Bounding Boxes
    Description

    Tank Small

    ## Overview
    
    Tank Small is a dataset for object detection tasks - it contains Tank Small annotations for 3,153 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. f

    Predictive performance of each complete model.

    • plos.figshare.com
    bin
    Updated Sep 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guanqi Lyu; Masaharu Nakayama (2023). Predictive performance of each complete model. [Dataset]. http://doi.org/10.1371/journal.pone.0291711.t002
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Guanqi Lyu; Masaharu Nakayama
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The aim of this study was to develop early prediction models for respiratory failure risk in patients with severe pneumonia using four ensemble learning algorithms: LightGBM, XGBoost, CatBoost, and random forest, and to compare the predictive performance of each model. In this study, we used the eICU Collaborative Research Database (eICU-CRD) for sample extraction, built a respiratory failure risk prediction model for patients with severe pneumonia based on four ensemble learning algorithms, and developed compact models corresponding to the four complete models to improve clinical practicality. The average area under receiver operating curve (AUROC) of the models on the test sets after ten random divisions of the dataset and the average accuracy at the best threshold were used as the evaluation metrics of the model performance. Finally, feature importance and Shapley additive explanation values were introduced to improve the interpretability of the model. A total of 1676 patients with pneumonia were analyzed in this study, of whom 297 developed respiratory failure one hour after admission to the intensive care unit (ICU). Both complete and compact CatBoost models had the highest average AUROC (0.858 and 0.857, respectively). The average accuracies at the best threshold were 75.19% and 77.33%, respectively. According to the feature importance bars and summary plot of the predictor variables, activetx (indicates whether the patient received active treatment), standard deviation of prothrombin time-international normalized ratio, Glasgow Coma Scale verbal score, age, and minimum oxygen saturation and respiratory rate were important. Compared with other ensemble learning models, the complete and compact CatBoost models have significantly higher average area under the curve values on the 10 randomly divided test sets. Additionally, the standard deviation (SD) of the compact CatBoost model is relatively small (SD:0.050), indicating that the performance of the compact CatBoost model is stable among these four ensemble learning models. The machine learning predictive models built in this study will help in early prediction and intervention of respiratory failure risk in patients with pneumonia in the ICU.

  9. Model Zoo: A Dataset of Diverse Populations of Resnet-18 Models - Tiny...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Aug 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth; Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth (2022). Model Zoo: A Dataset of Diverse Populations of Resnet-18 Models - Tiny ImageNet [Dataset]. http://doi.org/10.5281/zenodo.7023278
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth; Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 27 model zoos with varying hyperparameter combinations are generated and includes 50’360 unique neural network models resulting in over 2’585’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.

    Dataset

    This dataset is part of a larger collection of model zoos and contains the zoo of 1000 ResNet18 models trained on Tiny Imagenet. All zoos with extensive information and code can be found at www.modelzoos.cc.

    The complete zoo is 2.6TB large. Due to the size, this repository contains the checkpoints of the first 115 models at their last epoch 60. For a link to the full dataset as well as more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.

  10. R

    Current Small Fll Dataset

    • universe.roboflow.com
    zip
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corals Collab (2024). Current Small Fll Dataset [Dataset]. https://universe.roboflow.com/corals-collab/current-small-dataset-fll/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    Corals Collab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Corals Broccoli Cotton I11Z Bounding Boxes
    Description

    FLL training for FIXIKI team...this is for our innovation project, hope you find this useful!!!!!!

  11. R

    Plate Small Dataset

    • universe.roboflow.com
    zip
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unibg (2025). Plate Small Dataset [Dataset]. https://universe.roboflow.com/unibg/plate-small/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2025
    Dataset authored and provided by
    Unibg
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Plates Bounding Boxes
    Description

    Plate Small

    ## Overview
    
    Plate Small is a dataset for object detection tasks - it contains Plates annotations for 300 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  12. n

    Data from: Domain-specific neural networks improve automated bird sound...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrik Lauha; Panu Somervuo; Petteri Lehikoinen; Lisa Geres; Tobias Richter; Sebastian Seibold; Otso Ovaskainen (2022). Domain-specific neural networks improve automated bird sound recognition already with small amount of local data [Dataset]. http://doi.org/10.5061/dryad.2bvq83btd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Goethe University Frankfurt
    Technical University of Munich
    University of Helsinki
    University of Jyväskylä
    Authors
    Patrik Lauha; Panu Somervuo; Petteri Lehikoinen; Lisa Geres; Tobias Richter; Sebastian Seibold; Otso Ovaskainen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    An automatic bird sound recognition system is a useful tool for collecting data of different bird species for ecological analysis. Together with autonomous recording units (ARUs), such a system provides a possibility to collect bird observations on a scale that no human observer could ever match. During the last decades progress has been made in the field of automatic bird sound recognition, but recognizing bird species from untargeted soundscape recordings remains a challenge. In this article we demonstrate the workflow for building a global identification model and adjusting it to perform well on the data of autonomous recorders from a specific region. We show how data augmentation and a combination of global and local data can be used to train a convolutional neural network to classify vocalizations of 101 bird species. We construct a model and train it with a global data set to obtain a base model. The base model is then fine-tuned with local data from Southern Finland in order to adapt it to the sound environment of a specific location and tested with two data sets: one originating from the same Southern Finnish region and another originating from a different region in German Alps. Our results suggest that fine-tuning with local data significantly improves the network performance. Classification accuracy was improved for test recordings from the same area as the local training data (Southern Finland) but not for recordings from a different region (German Alps). Data augmentation enables training with a limited number of training data and even with few local data samples significant improvement over the base model can be achieved. Our model outperforms the current state-of-the-art tool for automatic bird sound classification. Using local data to adjust the recognition model for the target domain leads to improvement over general non-tailored solutions. The process introduced in this article can be applied to build a fine-tuned bird sound classification model for a specific environment. Methods This repository contains data and recognition models described in paper Domain-specific neural networks improve automated bird sound recognition already with small amount of local data. (Lauha et al., 2022).

  13. Data from: Challenges with Literature-Derived Data in Machine Learning for...

    • acs.figshare.com
    xlsx
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dong-Zhi Li; Xue-Qing Gong (2024). Challenges with Literature-Derived Data in Machine Learning for Yield Prediction: A Case Study on Pd-Catalyzed Carbonylation Reactions [Dataset]. http://doi.org/10.1021/acs.jpca.4c05489.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    ACS Publications
    Authors
    Dong-Zhi Li; Xue-Qing Gong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The application of machine learning (ML) to predict reaction yields has shown remarkable accuracy when based on high-throughput computational and experimental data. However, the accuracy significantly diminishes when leveraging literature-derived data, highlighting a gap in the predictive capability of the current ML models. This study, focusing on Pd-catalyzed carbonylation reactions, reveals that even with a data set of 2512 reactions, the best-performing model reaches only an R2 of 0.51. Further investigations show that the models’ effectiveness is predominantly confined to predictions within narrow subsets of data, closely related and from the same literature sources, rather than across the broader, heterogeneous data sets available in the literature. The reliance on data similarity, coupled with small sample sizes from the same sources, makes the model highly sensitive to inherent fluctuations typical of small data sets, adversely impacting stability, accuracy, and generalizability. The findings underscore the inherent limitations of current ML techniques in leveraging literature-derived data for predicting chemical reaction yields, highlighting the need for more sophisticated approaches to handle the complexity and diversity of chemical data.

  14. f

    Data from: Machine Learning-Assisted QSAR Models on Contaminant Reactivity...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shifa Zhong; Yanping Zhang; Huichun Zhang (2023). Machine Learning-Assisted QSAR Models on Contaminant Reactivity Toward Four Oxidants: Combining Small Data Sets and Knowledge Transfer [Dataset]. http://doi.org/10.1021/acs.est.1c04883.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Shifa Zhong; Yanping Zhang; Huichun Zhang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    To develop predictive models for the reactivity of organic contaminants toward four oxidantsSO4•–, HClO, O3, and ClO2all with small sample sizes, we proposed two approaches: combining small data sets and transferring knowledge between them. We first merged these data sets and developed a unified model using machine learning (ML), which showed better predictive performance than the individual models for HClO (RMSEtest: 2.1 to 2.04), O3 (2.06 to 1.94), ClO2 (1.77 to 1.49), and SO4•– (0.75 to 0.70) because the model “corrected” the wrongly learned effects of several atom groups. We further developed knowledge transfer models for three pairs of the data sets and observed different predictive performances: improved for O3 (RMSEtest: 2.06 to 2.01)/HClO (2.10 to 1.98), mixed for O3 (2.06 to 2.01)/ClO2 (1.77 to 1.95), and unchanged for ClO2 (1.77 to 1.77)/HClO (2.1 to 2.1). The effectiveness of the latter approach depended on whether there was consistent knowledge shared between the data sets and on the performance of the individual models. We also compared our approaches with multitask learning and image-based transfer learning and found that our approaches consistently improved the predictive performance for all data sets while the other two did not. This study demonstrated the effectiveness of combining small, similar data sets and transferring knowledge between them to improve ML model performance.

  15. STOOKE SMALL BODY SHAPE MODELS V1.0

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Aug 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2025). STOOKE SMALL BODY SHAPE MODELS V1.0 [Dataset]. https://catalog.data.gov/dataset/stooke-small-body-shape-models-v1-0-99c21
    Explore at:
    Dataset updated
    Aug 23, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Optical shape models of 10 planetary moons and asteroids, derived from spacecraft imaging by Philip Stooke.

  16. U

    Model predictions of biological condition for small streams in the...

    • data.usgs.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +1more
    Updated Jul 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelly Maloney; Kevin Krause (2022). Model predictions of biological condition for small streams in the Chesapeake Bay Watershed, USA [Dataset]. http://doi.org/10.5066/P9YKRPO1
    Explore at:
    Dataset updated
    Jul 13, 2022
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Kelly Maloney; Kevin Krause
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    1999 - 2019
    Area covered
    Chesapeake Bay, United States
    Description

    This data release contains predictions of stream biological condition as defined by the Chesapeake basin-wide index of biotic integrity for stream macroinvertebrates (Chessie BIBI) using Random Forest models with landscape data for small streams (≤ 200 km2 in upstream drainage) across the Chesapeake Bay Watershed (CBW). Predictions were made at eight time periods (2001, 2004, 2006, 2008, 2011, 2013, 2016, and 2019) according to changes in landcover using the National Land Cover Database (NLCD). The Chessie BIBI data used were provided by the Interstate Commission on the Potomac River Basin. Uncertainty was calculated using model prediction intervals. For complete data descriptions and data interpretation see associated publication (Maloney et al., 2022).

  17. N

    Fault and Severity Diagnosis using Deep Learning for Self-Organizing...

    • dataverse.lib.nycu.edu.tw
    bin, csv +4
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYCU Dataverse (2025). Fault and Severity Diagnosis using Deep Learning for Self-Organizing Networks with Imbalanced and Small Datasets [Dataset]. http://doi.org/10.57770/INXEBG
    Explore at:
    csv(515219), csv(1023678), text/x-python(16999), csv(520353), csv(1411973), csv(1315306), csv(419319), csv(1039304), csv(1019534), csv(533541), csv(1053404), bin(146127), csv(411642), text/x-python(17389), csv(1288190), csv(536353), csv(1042119), csv(1306987), csv(512001), csv(1310892), csv(564693), csv(1075272), csv(1401511), csv(1048524), csv(418259), csv(1402206), csv(1394990), text/x-python(18326), csv(512726), text/x-python(10697), csv(421298), text/x-python(16887), text/x-python(12639), csv(1052994), csv(409076), csv(511065), csv(1286746), csv(513997), csv(7530095), csv(1050837), csv(1407134), bin(146191), text/x-python(26141), csv(520140), csv(421892), csv(410627), csv(559587), csv(540735), csv(1404277), text/plain; charset=us-ascii(16878), csv(542944), text/x-python(26139), csv(1392823), csv(1293264), csv(556168), csv(1272645), csv(522420), csv(558658), csv(1319885), csv(1301899), csv(413072), csv(1054522), csv(409491), csv(1023185), csv(409655), tsv(16200), csv(1413778), csv(559135), tsv(24000), csv(566118), csv(413250), csv(556232), tsv(60000), csv(516810), csv(1303459), csv(526667), csv(1281076), csv(1048272), csv(1388481), csv(522390), csv(535094), csv(1038997), csv(551940), csv(559389), csv(562888), text/x-python(8469), csv(1300402), csv(1288201), csv(1387820), csv(510737), csv(409032), csv(509734), csv(415227), csv(1329823), csv(528371), csv(1050316), csv(1359135), csv(1371326), text/x-python(18697), csv(422100), csv(418591), text/x-python(28613), csv(1391931), csv(512115), csv(1048648), csv(550117), csv(563605), bin(1204155), csv(419984), csv(513624), csv(522237), csv(1382237), csv(1381623), csv(562695), csv(1038017), csv(1355711), csv(1027218), csv(409965), text/x-python(24512), tsv(40500), text/x-python(17047), csv(1315407), csv(1046988), csv(1398105), csv(412053), csv(1409634), csv(524004), csv(526437), bin(1204232), tsv(12000), csv(527814), csv(1028657), csv(430378), csv(523175), csv(1309115), csv(1327951), csv(559833), csv(555658), bin(1217194), text/x-python(26144), csv(418155), csv(1032586), csv(1050497), csv(418568), csv(1282289), csv(557343), csv(1314285), csv(544888), csv(418332), txt(982), csv(513438), csv(1370331), csv(565705), csv(1375925), text/x-python(16878), csv(1047170), csv(1402543), csv(1283082), csv(1314381), csv(1028525), csv(1280278), csv(7198182), csv(1408036), csv(408507), csv(1306636), csv(418741), csv(563282), csv(1406634), csv(1335802), csv(1412321), text/x-python(27564), csv(420972), csv(509670), csv(1051133), csv(1348847), bin(146448), csv(1023888), csv(1326111), csv(1312481), bin(1213802), csv(1315762), csv(512360), csv(525260), csv(420126), text/x-python(9120), csv(1089292), csv(1051556), csv(425722), csv(1055180), csv(511853), csv(514775), csv(416886), csv(530719), csv(530983), csv(419595), csv(1033271), csv(1042615), csv(525696), text/x-python(16591), csv(422687), csv(420075), csv(523188), text/x-python(12715), csv(1074975), text/x-python(25659), csv(419999), csv(1280712), csv(421838), csv(1284153), csv(548931), csv(409251), csv(1064749), csv(521587), csv(415540), text/x-python(26132), csv(1404723), csv(437023), csv(421550), csv(518986), text/x-python(18599), csv(565307), text/x-python(15829), csv(561085), csv(1288021), csv(420746), text/x-python(9331), csv(548243), csv(1412122), csv(1021840), text/x-python(17408), csv(553988), text/x-python(12828), csv(1284049), csv(1273670), csv(508393), text/x-python(16322), csv(414234), csv(430985), csv(1316860), csv(515749), csv(1059570), csv(1278518), csv(525907), csv(1032766), csv(1044355), csv(518765), csv(564765), csv(1281666), csv(1022896), csv(408363), txt(987), csv(1052547), csv(1031404), csv(1392590), csv(8313655), csv(1293485), csv(1280403), csv(563550), csv(1339900), csv(563063), csv(527185), csv(1054035), csv(564601), csv(557478), text/x-python(27950), csv(1314360), csv(1298353), csv(1280837), csv(553979), csv(562144), csv(421294), bin(146460), txt(1435), csv(417906), text/x-python(15414), csv(1054100), csv(1389256), csv(1020598), text/x-python(11941), csv(1053731), csv(1409607), tsv(37500), csv(1399017), txt(1427)Available download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    NYCU Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    With the growing complexity of wireless networks, manual management of networks becomes infeasible. To address this, self-organizing networks (SONs) have been introduced to provide solutions by offering self-organizing approaches to networks. Developing effective self-organizing approaches often depends on data-driven or learning-based methods, which require well-structured and balanced datasets. However, in practical scenarios, datasets are often imbalanced or even very small. To address this issue from the fault diagnosis aspect of SONs, this paper investigates the learning-based fault and severity diagnosis approaches under imbalanced and small datasets for wireless networks. We first propose a deep learning-based diagnosis framework, in which the diagnosis problem can be cast as a regression problem. Then, several approaches, including re-weighting, distribution smoothing, and balanced MSE, that can be used to resolve the imbalanced issue for regression problem are examined under the diagnosis purpose. Subsequently, to resolve the issue that the amount of data samples for diagnosis could be few, model pre-training and meta-learning-based approaches are used, aiming to quickly adapt the pre-trained diagnosis model to the targeting scenarios for diagnosis. Extensive simulation results based on realistic setups are conducted to evaluate our proposed approaches. Results show that our approaches can effectively diagnose the faults and their severity and outperform the baseline approaches under imbalanced and small datasets.

  18. f

    Source code in the R programming language, belonging with: Model based...

    • datasetcatalog.nlm.nih.gov
    • data.4tu.nl
    Updated Oct 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steinbuch, L.; Brus, D. J.; Orton, T. G. (2019). Source code in the R programming language, belonging with: Model based geostatistics from a Bayesian perspective: Investigating area‐to‐point kriging with small datasets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000138397
    Explore at:
    Dataset updated
    Oct 28, 2019
    Authors
    Steinbuch, L.; Brus, D. J.; Orton, T. G.
    Description

    Area-to-point kriging (ATPK) is a geostatistical method for creating maps of high resolution using data of much lower resolution. These R-scripts compare prediction uncertainty using different ATPK methods, using simulations and a real world case concerning crop yields in Burkina Faso.

  19. f

    Data from: Averaging Strategy for Interpretable Machine Learning on Small...

    • acs.figshare.com
    bin
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengjie Yu; Shiyu Tang; Sam Fong Yau Li; Fang Cheng (2023). Averaging Strategy for Interpretable Machine Learning on Small Datasets to Understand Element Uptake after Seed Nanotreatment [Dataset]. http://doi.org/10.1021/acs.est.3c01878.s002
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    ACS Publications
    Authors
    Hengjie Yu; Shiyu Tang; Sam Fong Yau Li; Fang Cheng
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Understanding plant uptake and translocation of nanomaterials is crucial for ensuring the successful and sustainable applications of seed nanotreatment. Here, we collect a dataset with 280 instances from experiments for predicting the relative metal/metalloid concentration (RMC) in maize seedlings after seed priming by various metal and metalloid oxide nanoparticles. To obtain unbiased predictions and explanations on small datasets, we present an averaging strategy and add a dimension for interpretable machine learning. The findings in post-hoc interpretations of sophisticated LightGBM models demonstrate that solubility is highly correlated with model performance. Surface area, concentration, zeta potential, and hydrodynamic diameter of nanoparticles and seedling part and relative weight of plants are dominant factors affecting RMC, and their effects and interactions are explained. Furthermore, self-interpretable models using the RuleFit algorithm are established to successfully predict RMC only based on six important features identified by post-hoc explanations. We then develop a visualization tool called RuleGrid to depict feature effects and interactions in numerous generated rules. Consistent parameter-RMC relationships are obtained by different methods. This study offers a promising interpretable data-driven approach to expand the knowledge of nanoparticle fate in plants and may profoundly contribute to the safety-by-design of nanomaterials in agricultural and environmental applications.

  20. h

    tiny-imagenet

    • huggingface.co
    • datasets.activeloop.ai
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Zheng (2022). tiny-imagenet [Dataset]. https://huggingface.co/datasets/zh-plus/tiny-imagenet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2022
    Authors
    Hao Zheng
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    Dataset Card for tiny-imagenet

      Dataset Summary
    

    Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

      Languages
    

    The class labels in the dataset are in English.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    { 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nam Pham (2024). tiny-textbooks [Dataset]. http://doi.org/10.57967/hf/1126

tiny-textbooks

Tiny Textbooks

nampdn-ai/tiny-textbooks

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 26, 2024
Authors
Nam Pham
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Textbook-like Dataset: A High-Quality Resource for Small Language Models

The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.

Search
Clear search
Close search
Google apps
Main menu