100+ datasets found

h
tiny-textbooks
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2024). tiny-textbooks [Dataset]. http://doi.org/10.57967/hf/1126
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1126
Dataset updated
Jan 26, 2024
Authors
Nam Pham
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Textbook-like Dataset: A High-Quality Resource for Small Language Models

The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.
R
Small Ml Dataset
universe.roboflow.com
zip
Updated Jun 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Machine Learning (2024). Small Ml Dataset [Dataset]. https://universe.roboflow.com/machine-learning-opc17/small-dataset-ml/model/6
Explore at:
zipAvailable download formats
Dataset updated
Jun 8, 2024
Dataset authored and provided by
Machine Learning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Post Bounding Boxes
Description
Small Dataset Ml

## Overview Small Dataset Ml is a dataset for object detection tasks - it contains Post annotations for 571 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
Open-ert-small-dataset
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Superb Emphasis (2025). Open-ert-small-dataset [Dataset]. https://huggingface.co/datasets/SuperbEmphasis/Open-ert-small-dataset
Explore at:
Dataset updated
Jun 1, 2025
Authors
Superb Emphasis
Description
This is a subset of: https://huggingface.co/datasets/openerotica/long-roleplay-v0.1 I am using mistral's new DEVSTRAL model to take the entire conversation in JSON format and rate it. I chose DEVSTRAL due to the mistral models being very consistent and well rounded. The Devstral model I was hoping could understand the JSON format a bit better. I ask the mode to rate each RP based on many different factors including grammar, prose, length (And a few others I will keep to myself :D). I then… See the full description on the dataset page: https://huggingface.co/datasets/SuperbEmphasis/Open-ert-small-dataset.
R
Object Small Dataset
universe.roboflow.com
zip
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
+ (2025). Object Small Dataset [Dataset]. https://universe.roboflow.com/-gajxq/object-small/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 14, 2025
Dataset authored and provided by
+
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Object Small Bounding Boxes
Description
Object Small

## Overview Object Small is a dataset for object detection tasks - it contains Object Small annotations for 4,165 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
wikipedia-small-3000-embedded
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Hafedh Hichri
License
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Description
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

load dataset in streaming mode (no download and it's fast)

dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

select 3000 samples

from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
f
Performance of ML models on test data.
plos.figshare.com
xls
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t005
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
R
Tank Small Dataset
universe.roboflow.com
zip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
test (2025). Tank Small Dataset [Dataset]. https://universe.roboflow.com/test-nbp8j/tank-small/model/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 13, 2025
Dataset authored and provided by
test
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Tank Small Bounding Boxes
Description
Tank Small

## Overview Tank Small is a dataset for object detection tasks - it contains Tank Small annotations for 3,153 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Predictive performance of each complete model.
plos.figshare.com
bin
Updated Sep 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guanqi Lyu; Masaharu Nakayama (2023). Predictive performance of each complete model. [Dataset]. http://doi.org/10.1371/journal.pone.0291711.t002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291711.t002
Dataset updated
Sep 21, 2023
Dataset provided by
PLOS ONE
Authors
Guanqi Lyu; Masaharu Nakayama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The aim of this study was to develop early prediction models for respiratory failure risk in patients with severe pneumonia using four ensemble learning algorithms: LightGBM, XGBoost, CatBoost, and random forest, and to compare the predictive performance of each model. In this study, we used the eICU Collaborative Research Database (eICU-CRD) for sample extraction, built a respiratory failure risk prediction model for patients with severe pneumonia based on four ensemble learning algorithms, and developed compact models corresponding to the four complete models to improve clinical practicality. The average area under receiver operating curve (AUROC) of the models on the test sets after ten random divisions of the dataset and the average accuracy at the best threshold were used as the evaluation metrics of the model performance. Finally, feature importance and Shapley additive explanation values were introduced to improve the interpretability of the model. A total of 1676 patients with pneumonia were analyzed in this study, of whom 297 developed respiratory failure one hour after admission to the intensive care unit (ICU). Both complete and compact CatBoost models had the highest average AUROC (0.858 and 0.857, respectively). The average accuracies at the best threshold were 75.19% and 77.33%, respectively. According to the feature importance bars and summary plot of the predictor variables, activetx (indicates whether the patient received active treatment), standard deviation of prothrombin time-international normalized ratio, Glasgow Coma Scale verbal score, age, and minimum oxygen saturation and respiratory rate were important. Compared with other ensemble learning models, the complete and compact CatBoost models have significantly higher average area under the curve values on the 10 randomly divided test sets. Additionally, the standard deviation (SD) of the compact CatBoost model is relatively small (SD:0.050), indicating that the performance of the compact CatBoost model is stable among these four ensemble learning models. The machine learning predictive models built in this study will help in early prediction and intervention of respiratory failure risk in patients with pneumonia in the ICU.
Model Zoo: A Dataset of Diverse Populations of Resnet-18 Models - Tiny...
zenodo.org
data.niaid.nih.gov
zip
Updated Aug 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth; Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth (2022). Model Zoo: A Dataset of Diverse Populations of Resnet-18 Models - Tiny ImageNet [Dataset]. http://doi.org/10.5281/zenodo.7023278
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7023278
Dataset updated
Aug 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth; Konstantin Schürholt; Diyar Taskiran; Boris Knyazev; Xavier Giró-i-Nieto; Damian Borth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 27 model zoos with varying hyperparameter combinations are generated and includes 50’360 unique neural network models resulting in over 2’585’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.

Dataset

This dataset is part of a larger collection of model zoos and contains the zoo of 1000 ResNet18 models trained on Tiny Imagenet. All zoos with extensive information and code can be found at www.modelzoos.cc.

The complete zoo is 2.6TB large. Due to the size, this repository contains the checkpoints of the first 115 models at their last epoch 60. For a link to the full dataset as well as more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.
R
Current Small Fll Dataset
universe.roboflow.com
zip
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corals Collab (2024). Current Small Fll Dataset [Dataset]. https://universe.roboflow.com/corals-collab/current-small-dataset-fll/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Nov 6, 2024
Dataset authored and provided by
Corals Collab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Corals Broccoli Cotton I11Z Bounding Boxes
Description
FLL training for FIXIKI team...this is for our innovation project, hope you find this useful!!!!!!
R
Plate Small Dataset
universe.roboflow.com
zip
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unibg (2025). Plate Small Dataset [Dataset]. https://universe.roboflow.com/unibg/plate-small/model/1
Explore at:
zipAvailable download formats
Dataset updated
May 31, 2025
Dataset authored and provided by
Unibg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Plates Bounding Boxes
Description
Plate Small

## Overview Plate Small is a dataset for object detection tasks - it contains Plates annotations for 300 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
n
Data from: Domain-specific neural networks improve automated bird sound...
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrik Lauha; Panu Somervuo; Petteri Lehikoinen; Lisa Geres; Tobias Richter; Sebastian Seibold; Otso Ovaskainen (2022). Domain-specific neural networks improve automated bird sound recognition already with small amount of local data [Dataset]. http://doi.org/10.5061/dryad.2bvq83btd
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2bvq83btd
Dataset updated
Sep 28, 2022
Dataset provided by
Goethe University Frankfurt
Technical University of Munich
University of Helsinki
University of Jyväskylä
Authors
Patrik Lauha; Panu Somervuo; Petteri Lehikoinen; Lisa Geres; Tobias Richter; Sebastian Seibold; Otso Ovaskainen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
An automatic bird sound recognition system is a useful tool for collecting data of different bird species for ecological analysis. Together with autonomous recording units (ARUs), such a system provides a possibility to collect bird observations on a scale that no human observer could ever match. During the last decades progress has been made in the field of automatic bird sound recognition, but recognizing bird species from untargeted soundscape recordings remains a challenge. In this article we demonstrate the workflow for building a global identification model and adjusting it to perform well on the data of autonomous recorders from a specific region. We show how data augmentation and a combination of global and local data can be used to train a convolutional neural network to classify vocalizations of 101 bird species. We construct a model and train it with a global data set to obtain a base model. The base model is then fine-tuned with local data from Southern Finland in order to adapt it to the sound environment of a specific location and tested with two data sets: one originating from the same Southern Finnish region and another originating from a different region in German Alps. Our results suggest that fine-tuning with local data significantly improves the network performance. Classification accuracy was improved for test recordings from the same area as the local training data (Southern Finland) but not for recordings from a different region (German Alps). Data augmentation enables training with a limited number of training data and even with few local data samples significant improvement over the base model can be achieved. Our model outperforms the current state-of-the-art tool for automatic bird sound classification. Using local data to adjust the recognition model for the target domain leads to improvement over general non-tailored solutions. The process introduced in this article can be applied to build a fine-tuned bird sound classification model for a specific environment. Methods This repository contains data and recognition models described in paper Domain-specific neural networks improve automated bird sound recognition already with small amount of local data. (Lauha et al., 2022).
Data from: Challenges with Literature-Derived Data in Machine Learning for...
acs.figshare.com
xlsx
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dong-Zhi Li; Xue-Qing Gong (2024). Challenges with Literature-Derived Data in Machine Learning for Yield Prediction: A Case Study on Pd-Catalyzed Carbonylation Reactions [Dataset]. http://doi.org/10.1021/acs.jpca.4c05489.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jpca.4c05489.s002
Dataset updated
Nov 20, 2024
Dataset provided by
ACS Publications
Authors
Dong-Zhi Li; Xue-Qing Gong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The application of machine learning (ML) to predict reaction yields has shown remarkable accuracy when based on high-throughput computational and experimental data. However, the accuracy significantly diminishes when leveraging literature-derived data, highlighting a gap in the predictive capability of the current ML models. This study, focusing on Pd-catalyzed carbonylation reactions, reveals that even with a data set of 2512 reactions, the best-performing model reaches only an R2 of 0.51. Further investigations show that the models’ effectiveness is predominantly confined to predictions within narrow subsets of data, closely related and from the same literature sources, rather than across the broader, heterogeneous data sets available in the literature. The reliance on data similarity, coupled with small sample sizes from the same sources, makes the model highly sensitive to inherent fluctuations typical of small data sets, adversely impacting stability, accuracy, and generalizability. The findings underscore the inherent limitations of current ML techniques in leveraging literature-derived data for predicting chemical reaction yields, highlighting the need for more sophisticated approaches to handle the complexity and diversity of chemical data.
f
Data from: Machine Learning-Assisted QSAR Models on Contaminant Reactivity...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shifa Zhong; Yanping Zhang; Huichun Zhang (2023). Machine Learning-Assisted QSAR Models on Contaminant Reactivity Toward Four Oxidants: Combining Small Data Sets and Knowledge Transfer [Dataset]. http://doi.org/10.1021/acs.est.1c04883.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.1c04883.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Shifa Zhong; Yanping Zhang; Huichun Zhang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
To develop predictive models for the reactivity of organic contaminants toward four oxidantsSO4•–, HClO, O3, and ClO2all with small sample sizes, we proposed two approaches: combining small data sets and transferring knowledge between them. We first merged these data sets and developed a unified model using machine learning (ML), which showed better predictive performance than the individual models for HClO (RMSEtest: 2.1 to 2.04), O3 (2.06 to 1.94), ClO2 (1.77 to 1.49), and SO4•– (0.75 to 0.70) because the model “corrected” the wrongly learned effects of several atom groups. We further developed knowledge transfer models for three pairs of the data sets and observed different predictive performances: improved for O3 (RMSEtest: 2.06 to 2.01)/HClO (2.10 to 1.98), mixed for O3 (2.06 to 2.01)/ClO2 (1.77 to 1.95), and unchanged for ClO2 (1.77 to 1.77)/HClO (2.1 to 2.1). The effectiveness of the latter approach depended on whether there was consistent knowledge shared between the data sets and on the performance of the individual models. We also compared our approaches with multitask learning and image-based transfer learning and found that our approaches consistently improved the predictive performance for all data sets while the other two did not. This study demonstrated the effectiveness of combining small, similar data sets and transferring knowledge between them to improve ML model performance.
STOOKE SMALL BODY SHAPE MODELS V1.0
catalog.data.gov
datasets.ai
+3more
Updated Aug 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). STOOKE SMALL BODY SHAPE MODELS V1.0 [Dataset]. https://catalog.data.gov/dataset/stooke-small-body-shape-models-v1-0-99c21
Explore at:
Dataset updated
Aug 23, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Optical shape models of 10 planetary moons and asteroids, derived from spacecraft imaging by Philip Stooke.
U
Model predictions of biological condition for small streams in the...
data.usgs.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
Updated Jul 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelly Maloney; Kevin Krause (2022). Model predictions of biological condition for small streams in the Chesapeake Bay Watershed, USA [Dataset]. http://doi.org/10.5066/P9YKRPO1
Explore at:
Unique identifier
https://doi.org/10.5066/P9YKRPO1
Dataset updated
Jul 13, 2022
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Kelly Maloney; Kevin Krause
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
1999 - 2019
Area covered
Chesapeake Bay, United States
Description
This data release contains predictions of stream biological condition as defined by the Chesapeake basin-wide index of biotic integrity for stream macroinvertebrates (Chessie BIBI) using Random Forest models with landscape data for small streams (≤ 200 km2 in upstream drainage) across the Chesapeake Bay Watershed (CBW). Predictions were made at eight time periods (2001, 2004, 2006, 2008, 2011, 2013, 2016, and 2019) according to changes in landcover using the National Land Cover Database (NLCD). The Chessie BIBI data used were provided by the Interstate Commission on the Potomac River Basin. Uncertainty was calculated using model prediction intervals. For complete data descriptions and data interpretation see associated publication (Maloney et al., 2022).
N
Fault and Severity Diagnosis using Deep Learning for Self-Organizing...
dataverse.lib.nycu.edu.tw
bin, csv +4
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYCU Dataverse (2025). Fault and Severity Diagnosis using Deep Learning for Self-Organizing Networks with Imbalanced and Small Datasets [Dataset]. http://doi.org/10.57770/INXEBG
Explore at:
csv(515219), csv(1023678), text/x-python(16999), csv(520353), csv(1411973), csv(1315306), csv(419319), csv(1039304), csv(1019534), csv(533541), csv(1053404), bin(146127), csv(411642), text/x-python(17389), csv(1288190), csv(536353), csv(1042119), csv(1306987), csv(512001), csv(1310892), csv(564693), csv(1075272), csv(1401511), csv(1048524), csv(418259), csv(1402206), csv(1394990), text/x-python(18326), csv(512726), text/x-python(10697), csv(421298), text/x-python(16887), text/x-python(12639), csv(1052994), csv(409076), csv(511065), csv(1286746), csv(513997), csv(7530095), csv(1050837), csv(1407134), bin(146191), text/x-python(26141), csv(520140), csv(421892), csv(410627), csv(559587), csv(540735), csv(1404277), text/plain; charset=us-ascii(16878), csv(542944), text/x-python(26139), csv(1392823), csv(1293264), csv(556168), csv(1272645), csv(522420), csv(558658), csv(1319885), csv(1301899), csv(413072), csv(1054522), csv(409491), csv(1023185), csv(409655), tsv(16200), csv(1413778), csv(559135), tsv(24000), csv(566118), csv(413250), csv(556232), tsv(60000), csv(516810), csv(1303459), csv(526667), csv(1281076), csv(1048272), csv(1388481), csv(522390), csv(535094), csv(1038997), csv(551940), csv(559389), csv(562888), text/x-python(8469), csv(1300402), csv(1288201), csv(1387820), csv(510737), csv(409032), csv(509734), csv(415227), csv(1329823), csv(528371), csv(1050316), csv(1359135), csv(1371326), text/x-python(18697), csv(422100), csv(418591), text/x-python(28613), csv(1391931), csv(512115), csv(1048648), csv(550117), csv(563605), bin(1204155), csv(419984), csv(513624), csv(522237), csv(1382237), csv(1381623), csv(562695), csv(1038017), csv(1355711), csv(1027218), csv(409965), text/x-python(24512), tsv(40500), text/x-python(17047), csv(1315407), csv(1046988), csv(1398105), csv(412053), csv(1409634), csv(524004), csv(526437), bin(1204232), tsv(12000), csv(527814), csv(1028657), csv(430378), csv(523175), csv(1309115), csv(1327951), csv(559833), csv(555658), bin(1217194), text/x-python(26144), csv(418155), csv(1032586), csv(1050497), csv(418568), csv(1282289), csv(557343), csv(1314285), csv(544888), csv(418332), txt(982), csv(513438), csv(1370331), csv(565705), csv(1375925), text/x-python(16878), csv(1047170), csv(1402543), csv(1283082), csv(1314381), csv(1028525), csv(1280278), csv(7198182), csv(1408036), csv(408507), csv(1306636), csv(418741), csv(563282), csv(1406634), csv(1335802), csv(1412321), text/x-python(27564), csv(420972), csv(509670), csv(1051133), csv(1348847), bin(146448), csv(1023888), csv(1326111), csv(1312481), bin(1213802), csv(1315762), csv(512360), csv(525260), csv(420126), text/x-python(9120), csv(1089292), csv(1051556), csv(425722), csv(1055180), csv(511853), csv(514775), csv(416886), csv(530719), csv(530983), csv(419595), csv(1033271), csv(1042615), csv(525696), text/x-python(16591), csv(422687), csv(420075), csv(523188), text/x-python(12715), csv(1074975), text/x-python(25659), csv(419999), csv(1280712), csv(421838), csv(1284153), csv(548931), csv(409251), csv(1064749), csv(521587), csv(415540), text/x-python(26132), csv(1404723), csv(437023), csv(421550), csv(518986), text/x-python(18599), csv(565307), text/x-python(15829), csv(561085), csv(1288021), csv(420746), text/x-python(9331), csv(548243), csv(1412122), csv(1021840), text/x-python(17408), csv(553988), text/x-python(12828), csv(1284049), csv(1273670), csv(508393), text/x-python(16322), csv(414234), csv(430985), csv(1316860), csv(515749), csv(1059570), csv(1278518), csv(525907), csv(1032766), csv(1044355), csv(518765), csv(564765), csv(1281666), csv(1022896), csv(408363), txt(987), csv(1052547), csv(1031404), csv(1392590), csv(8313655), csv(1293485), csv(1280403), csv(563550), csv(1339900), csv(563063), csv(527185), csv(1054035), csv(564601), csv(557478), text/x-python(27950), csv(1314360), csv(1298353), csv(1280837), csv(553979), csv(562144), csv(421294), bin(146460), txt(1435), csv(417906), text/x-python(15414), csv(1054100), csv(1389256), csv(1020598), text/x-python(11941), csv(1053731), csv(1409607), tsv(37500), csv(1399017), txt(1427)Available download formats
Unique identifier
https://doi.org/10.57770/INXEBG
Dataset updated
Feb 6, 2025
Dataset provided by
NYCU Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
With the growing complexity of wireless networks, manual management of networks becomes infeasible. To address this, self-organizing networks (SONs) have been introduced to provide solutions by offering self-organizing approaches to networks. Developing effective self-organizing approaches often depends on data-driven or learning-based methods, which require well-structured and balanced datasets. However, in practical scenarios, datasets are often imbalanced or even very small. To address this issue from the fault diagnosis aspect of SONs, this paper investigates the learning-based fault and severity diagnosis approaches under imbalanced and small datasets for wireless networks. We first propose a deep learning-based diagnosis framework, in which the diagnosis problem can be cast as a regression problem. Then, several approaches, including re-weighting, distribution smoothing, and balanced MSE, that can be used to resolve the imbalanced issue for regression problem are examined under the diagnosis purpose. Subsequently, to resolve the issue that the amount of data samples for diagnosis could be few, model pre-training and meta-learning-based approaches are used, aiming to quickly adapt the pre-trained diagnosis model to the targeting scenarios for diagnosis. Extensive simulation results based on realistic setups are conducted to evaluate our proposed approaches. Results show that our approaches can effectively diagnose the faults and their severity and outperform the baseline approaches under imbalanced and small datasets.
f
Source code in the R programming language, belonging with: Model based...
datasetcatalog.nlm.nih.gov
data.4tu.nl
Updated Oct 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steinbuch, L.; Brus, D. J.; Orton, T. G. (2019). Source code in the R programming language, belonging with: Model based geostatistics from a Bayesian perspective: Investigating area‐to‐point kriging with small datasets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000138397
Explore at:
Dataset updated
Oct 28, 2019
Authors
Steinbuch, L.; Brus, D. J.; Orton, T. G.
Description
Area-to-point kriging (ATPK) is a geostatistical method for creating maps of high resolution using data of much lower resolution. These R-scripts compare prediction uncertainty using different ATPK methods, using simulations and a real world case concerning crop yields in Burkina Faso.
f
Data from: Averaging Strategy for Interpretable Machine Learning on Small...
acs.figshare.com
bin
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hengjie Yu; Shiyu Tang; Sam Fong Yau Li; Fang Cheng (2023). Averaging Strategy for Interpretable Machine Learning on Small Datasets to Understand Element Uptake after Seed Nanotreatment [Dataset]. http://doi.org/10.1021/acs.est.3c01878.s002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.3c01878.s002
Dataset updated
Aug 18, 2023
Dataset provided by
ACS Publications
Authors
Hengjie Yu; Shiyu Tang; Sam Fong Yau Li; Fang Cheng
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Understanding plant uptake and translocation of nanomaterials is crucial for ensuring the successful and sustainable applications of seed nanotreatment. Here, we collect a dataset with 280 instances from experiments for predicting the relative metal/metalloid concentration (RMC) in maize seedlings after seed priming by various metal and metalloid oxide nanoparticles. To obtain unbiased predictions and explanations on small datasets, we present an averaging strategy and add a dimension for interpretable machine learning. The findings in post-hoc interpretations of sophisticated LightGBM models demonstrate that solubility is highly correlated with model performance. Surface area, concentration, zeta potential, and hydrodynamic diameter of nanoparticles and seedling part and relative weight of plants are dominant factors affecting RMC, and their effects and interactions are explained. Furthermore, self-interpretable models using the RuleFit algorithm are established to successfully predict RMC only based on six important features identified by post-hoc explanations. We then develop a visualization tool called RuleGrid to depict feature effects and interactions in numerous generated rules. Consistent parameter-RMC relationships are obtained by different methods. This study offers a promising interpretable data-driven approach to expand the knowledge of nanoparticle fate in plants and may profoundly contribute to the safety-by-design of nanomaterials in agricultural and environmental applications.
h
tiny-imagenet
huggingface.co
datasets.activeloop.ai
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Zheng (2022). tiny-imagenet [Dataset]. https://huggingface.co/datasets/zh-plus/tiny-imagenet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2022
Authors
Hao Zheng
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
Dataset Card for tiny-imagenet

Dataset Summary

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

Languages

The class labels in the dataset are in English.

Dataset Structure Data Instances

{ 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nam Pham (2024). tiny-textbooks [Dataset]. http://doi.org/10.57967/hf/1126

tiny-textbooks

Tiny Textbooks

nampdn-ai/tiny-textbooks

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.57967/hf/1126

Dataset updated

Jan 26, 2024

Authors

Nam Pham

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Textbook-like Dataset: A High-Quality Resource for Small Language Models

The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.

Clear search

Close search

Google apps

Main menu

tiny-textbooks

Small Ml Dataset

Small Dataset Ml

Open-ert-small-dataset

Object Small Dataset

Object Small

wikipedia-small-3000-embedded

load dataset in streaming mode (no download and it's fast)

select 3000 samples

Performance of ML models on test data.

Tank Small Dataset

Tank Small

Predictive performance of each complete model.

Model Zoo: A Dataset of Diverse Populations of Resnet-18 Models - Tiny...

Current Small Fll Dataset

Plate Small Dataset

Plate Small

Data from: Domain-specific neural networks improve automated bird sound...

Data from: Challenges with Literature-Derived Data in Machine Learning for...

Data from: Machine Learning-Assisted QSAR Models on Contaminant Reactivity...

STOOKE SMALL BODY SHAPE MODELS V1.0

Model predictions of biological condition for small streams in the...

Fault and Severity Diagnosis using Deep Learning for Self-Organizing...

Source code in the R programming language, belonging with: Model based...

Data from: Averaging Strategy for Interpretable Machine Learning on Small...

tiny-imagenet

tiny-textbooks

Tiny Textbooks

nampdn-ai/tiny-textbooks