Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Textbook-like Dataset: A High-Quality Resource for Small Language Models
The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Small Dataset Ml is a dataset for object detection tasks - it contains Post annotations for 571 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
This is a subset of: https://huggingface.co/datasets/openerotica/long-roleplay-v0.1 I am using mistral's new DEVSTRAL model to take the entire conversation in JSON format and rate it. I chose DEVSTRAL due to the mistral models being very consistent and well rounded. The Devstral model I was hoping could understand the JSON format a bit better. I ask the mode to rate each RP based on many different factors including grammar, prose, length (And a few others I will keep to myself :D). I then… See the full description on the dataset page: https://huggingface.co/datasets/SuperbEmphasis/Open-ert-small-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Object Small is a dataset for object detection tasks - it contains Object Small annotations for 4,165 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )
from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Tank Small is a dataset for object detection tasks - it contains Tank Small annotations for 3,153 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of this study was to develop early prediction models for respiratory failure risk in patients with severe pneumonia using four ensemble learning algorithms: LightGBM, XGBoost, CatBoost, and random forest, and to compare the predictive performance of each model. In this study, we used the eICU Collaborative Research Database (eICU-CRD) for sample extraction, built a respiratory failure risk prediction model for patients with severe pneumonia based on four ensemble learning algorithms, and developed compact models corresponding to the four complete models to improve clinical practicality. The average area under receiver operating curve (AUROC) of the models on the test sets after ten random divisions of the dataset and the average accuracy at the best threshold were used as the evaluation metrics of the model performance. Finally, feature importance and Shapley additive explanation values were introduced to improve the interpretability of the model. A total of 1676 patients with pneumonia were analyzed in this study, of whom 297 developed respiratory failure one hour after admission to the intensive care unit (ICU). Both complete and compact CatBoost models had the highest average AUROC (0.858 and 0.857, respectively). The average accuracies at the best threshold were 75.19% and 77.33%, respectively. According to the feature importance bars and summary plot of the predictor variables, activetx (indicates whether the patient received active treatment), standard deviation of prothrombin time-international normalized ratio, Glasgow Coma Scale verbal score, age, and minimum oxygen saturation and respiratory rate were important. Compared with other ensemble learning models, the complete and compact CatBoost models have significantly higher average area under the curve values on the 10 randomly divided test sets. Additionally, the standard deviation (SD) of the compact CatBoost model is relatively small (SD:0.050), indicating that the performance of the compact CatBoost model is stable among these four ensemble learning models. The machine learning predictive models built in this study will help in early prediction and intervention of respiratory failure risk in patients with pneumonia in the ICU.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 27 model zoos with varying hyperparameter combinations are generated and includes 50’360 unique neural network models resulting in over 2’585’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.
Dataset
This dataset is part of a larger collection of model zoos and contains the zoo of 1000 ResNet18 models trained on Tiny Imagenet. All zoos with extensive information and code can be found at www.modelzoos.cc.
The complete zoo is 2.6TB large. Due to the size, this repository contains the checkpoints of the first 115 models at their last epoch 60. For a link to the full dataset as well as more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FLL training for FIXIKI team...this is for our innovation project, hope you find this useful!!!!!!
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Plate Small is a dataset for object detection tasks - it contains Plates annotations for 300 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
An automatic bird sound recognition system is a useful tool for collecting data of different bird species for ecological analysis. Together with autonomous recording units (ARUs), such a system provides a possibility to collect bird observations on a scale that no human observer could ever match. During the last decades progress has been made in the field of automatic bird sound recognition, but recognizing bird species from untargeted soundscape recordings remains a challenge. In this article we demonstrate the workflow for building a global identification model and adjusting it to perform well on the data of autonomous recorders from a specific region. We show how data augmentation and a combination of global and local data can be used to train a convolutional neural network to classify vocalizations of 101 bird species. We construct a model and train it with a global data set to obtain a base model. The base model is then fine-tuned with local data from Southern Finland in order to adapt it to the sound environment of a specific location and tested with two data sets: one originating from the same Southern Finnish region and another originating from a different region in German Alps. Our results suggest that fine-tuning with local data significantly improves the network performance. Classification accuracy was improved for test recordings from the same area as the local training data (Southern Finland) but not for recordings from a different region (German Alps). Data augmentation enables training with a limited number of training data and even with few local data samples significant improvement over the base model can be achieved. Our model outperforms the current state-of-the-art tool for automatic bird sound classification. Using local data to adjust the recognition model for the target domain leads to improvement over general non-tailored solutions. The process introduced in this article can be applied to build a fine-tuned bird sound classification model for a specific environment. Methods This repository contains data and recognition models described in paper Domain-specific neural networks improve automated bird sound recognition already with small amount of local data. (Lauha et al., 2022).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The application of machine learning (ML) to predict reaction yields has shown remarkable accuracy when based on high-throughput computational and experimental data. However, the accuracy significantly diminishes when leveraging literature-derived data, highlighting a gap in the predictive capability of the current ML models. This study, focusing on Pd-catalyzed carbonylation reactions, reveals that even with a data set of 2512 reactions, the best-performing model reaches only an R2 of 0.51. Further investigations show that the models’ effectiveness is predominantly confined to predictions within narrow subsets of data, closely related and from the same literature sources, rather than across the broader, heterogeneous data sets available in the literature. The reliance on data similarity, coupled with small sample sizes from the same sources, makes the model highly sensitive to inherent fluctuations typical of small data sets, adversely impacting stability, accuracy, and generalizability. The findings underscore the inherent limitations of current ML techniques in leveraging literature-derived data for predicting chemical reaction yields, highlighting the need for more sophisticated approaches to handle the complexity and diversity of chemical data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To develop predictive models for the reactivity of organic contaminants toward four oxidantsSO4•–, HClO, O3, and ClO2all with small sample sizes, we proposed two approaches: combining small data sets and transferring knowledge between them. We first merged these data sets and developed a unified model using machine learning (ML), which showed better predictive performance than the individual models for HClO (RMSEtest: 2.1 to 2.04), O3 (2.06 to 1.94), ClO2 (1.77 to 1.49), and SO4•– (0.75 to 0.70) because the model “corrected” the wrongly learned effects of several atom groups. We further developed knowledge transfer models for three pairs of the data sets and observed different predictive performances: improved for O3 (RMSEtest: 2.06 to 2.01)/HClO (2.10 to 1.98), mixed for O3 (2.06 to 2.01)/ClO2 (1.77 to 1.95), and unchanged for ClO2 (1.77 to 1.77)/HClO (2.1 to 2.1). The effectiveness of the latter approach depended on whether there was consistent knowledge shared between the data sets and on the performance of the individual models. We also compared our approaches with multitask learning and image-based transfer learning and found that our approaches consistently improved the predictive performance for all data sets while the other two did not. This study demonstrated the effectiveness of combining small, similar data sets and transferring knowledge between them to improve ML model performance.
Optical shape models of 10 planetary moons and asteroids, derived from spacecraft imaging by Philip Stooke.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This data release contains predictions of stream biological condition as defined by the Chesapeake basin-wide index of biotic integrity for stream macroinvertebrates (Chessie BIBI) using Random Forest models with landscape data for small streams (≤ 200 km2 in upstream drainage) across the Chesapeake Bay Watershed (CBW). Predictions were made at eight time periods (2001, 2004, 2006, 2008, 2011, 2013, 2016, and 2019) according to changes in landcover using the National Land Cover Database (NLCD). The Chessie BIBI data used were provided by the Interstate Commission on the Potomac River Basin. Uncertainty was calculated using model prediction intervals. For complete data descriptions and data interpretation see associated publication (Maloney et al., 2022).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
With the growing complexity of wireless networks, manual management of networks becomes infeasible. To address this, self-organizing networks (SONs) have been introduced to provide solutions by offering self-organizing approaches to networks. Developing effective self-organizing approaches often depends on data-driven or learning-based methods, which require well-structured and balanced datasets. However, in practical scenarios, datasets are often imbalanced or even very small. To address this issue from the fault diagnosis aspect of SONs, this paper investigates the learning-based fault and severity diagnosis approaches under imbalanced and small datasets for wireless networks. We first propose a deep learning-based diagnosis framework, in which the diagnosis problem can be cast as a regression problem. Then, several approaches, including re-weighting, distribution smoothing, and balanced MSE, that can be used to resolve the imbalanced issue for regression problem are examined under the diagnosis purpose. Subsequently, to resolve the issue that the amount of data samples for diagnosis could be few, model pre-training and meta-learning-based approaches are used, aiming to quickly adapt the pre-trained diagnosis model to the targeting scenarios for diagnosis. Extensive simulation results based on realistic setups are conducted to evaluate our proposed approaches. Results show that our approaches can effectively diagnose the faults and their severity and outperform the baseline approaches under imbalanced and small datasets.
Area-to-point kriging (ATPK) is a geostatistical method for creating maps of high resolution using data of much lower resolution. These R-scripts compare prediction uncertainty using different ATPK methods, using simulations and a real world case concerning crop yields in Burkina Faso.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Understanding plant uptake and translocation of nanomaterials is crucial for ensuring the successful and sustainable applications of seed nanotreatment. Here, we collect a dataset with 280 instances from experiments for predicting the relative metal/metalloid concentration (RMC) in maize seedlings after seed priming by various metal and metalloid oxide nanoparticles. To obtain unbiased predictions and explanations on small datasets, we present an averaging strategy and add a dimension for interpretable machine learning. The findings in post-hoc interpretations of sophisticated LightGBM models demonstrate that solubility is highly correlated with model performance. Surface area, concentration, zeta potential, and hydrodynamic diameter of nanoparticles and seedling part and relative weight of plants are dominant factors affecting RMC, and their effects and interactions are explained. Furthermore, self-interpretable models using the RuleFit algorithm are established to successfully predict RMC only based on six important features identified by post-hoc explanations. We then develop a visualization tool called RuleGrid to depict feature effects and interactions in numerous generated rules. Consistent parameter-RMC relationships are obtained by different methods. This study offers a promising interpretable data-driven approach to expand the knowledge of nanoparticle fate in plants and may profoundly contribute to the safety-by-design of nanomaterials in agricultural and environmental applications.
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Dataset Card for tiny-imagenet
Dataset Summary
Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.
Languages
The class labels in the dataset are in English.
Dataset Structure
Data Instances
{ 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Textbook-like Dataset: A High-Quality Resource for Small Language Models
The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.