Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Including the split of real and null reactions for training, validation and test
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Challenge 2 Image Sets. Training data is accompanied by interpolated steering values. Test data only has center image frames.
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training, Validation and Test Data for model presented in paper 'A Little Data Goes A Long Way: Automating Seismic Phase Arrival Picking at Nabro Volcano with Transfer Learning', submitted to Journal of Geophysical Research: Solid Earth.
Files:
- train_events_2498.h5 = training set of seismic waveforms (events with P-/S-wave labelled arrivals only, i.e., no noise waveforms)
- train_events_2498.pkl = event training set metadata (UTC P-/S-wave phase arrival times)
- train_noise_2498.h5 = training set of seismic waveforms (noise sections only, i.e., no event waveforms)
- train_noise_2498.pkl = noise training set metadata (UTC time for training noise waveforms)
- val_events.h5 = validation set of seismic waveforms (events with P-/S-wave labelled arrivals only, i.e., no noise waveforms)
- val_events.pkl = event validation set metadata (UTC P-/S-wave phase arrival times)
- val_noise.h5 = validation set of seismic waveforms (noise sections only, i.e., no event waveforms)
- val_noise.pkl = noise validation set metadata (UTC time for validation noise waveforms)
- test.h5 = test set of seismic waveforms (events and noise)
- test_events.pkl = event test set metadata (UTC P-/S-wave phase arrival times for test event waveforms)
- test_noise.pkl = noise test set metadata (UTC time for test noise waveforms)
- nabro_2011-247.mseed = 24 hours seismic data from Nabro Urgency Array (2011-09-04), saved in mseed format (e.g., can be read with obspy)
- nabro_2011-269.mseed = 24 hours seismic data from Nabro Urgency Array (2011-09-26), saved in mseed format (e.g., can be read with obspy)
Further details and code for reading and using these files can be found at the GitHub repo for this paper: https://github.com/sachalapins/U-GPD
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Inspect Dataset: test_dataset
Dataset Information
This dataset was created using the create_inspect_dataset function from the deception_sprint package on 2025-05-02.
Model Information
Model: vllm/meta-llama/Llama-3.2-1B-Instruct
Task Information
Tasks: deception_sprint/wmdp_bio, deception_sprint/wmdp_chem, deception_sprint/wmdp_cyber, deception_sprint/cybermetric_2000, deception_sprint/sec_qa_v1, deception_sprint/sec_qa_v2… See the full description on the dataset page: https://huggingface.co/datasets/aisi-whitebox/training-set-mo-v1-test.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
After fast mean shift (FMS) clustering, the whole research area was divided to 10 subareas, so the new samples can characterize the geographical features of each subarea were collected through field investigations. Because of our limited human and material resources, it is difficult to conduct a mass of sampling in each subarea. In order to make the most of our limited resources, we need to conduct reasonable field sampling strategy. For the first two large subareas, we collected 70 field samples respectively, and labeled them as the first sample set and the second sample set that will be used to build their own GWR models for extend prediction of unobserved points in each area, i.e. local extension prediction; while the remaining 8 small subareas took moderate amounts of samples according to their size, if one subarea owns the size of raster points more than 5000, 16 samples will be collected from it, otherwise, take 12 samples. In this way, a total of 112 samples are put together as the third sample set, and the third GWR model is constructed to achieve the global extension prediction of 8 subareas. In addition, three sample sets were divided into training set and test set, respectively. For the first two sample sets, the ratio of sample size of training set and test set are all 5:2, i.e. training set contains 50 samples, test set has 20 samples. Because of the third sample set composed of samples from 8 subareas, we divided the samples of each subarea into training set and test set according to the ratio of 3:1. In the other word, the sample number of training set from third to tenth subarea is 12, 9, 9, 12, 9, 12, 12 and 9 respectively, and 84 training sample in total; and the sample number of test set from eight subarea is 4, 3, 3, 4, 3, 4, 4 and 3 respectively, a total of 28 samples.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Annotated test and train data sets. Both images and annotations are provided separately.
Validation data set for Hi5, Sf9 and HEK cells.
Confusion matrices for the determination of performance parameters
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.
The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.
prompt=f""" I am participating in an SVG code generation competition. The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows: - Descriptions are generic and do not contain brand names, trademarks, or personal names. - No descriptions include people, even in generic terms. - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters. - Categories cover various domains, with some overlap between public and private test sets. To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style? Requirements: - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**. - Ensure **diversity and creativity** across topics. - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**. - Avoid duplication or overly similar phrasing. Example topics: a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid, purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet, a snowy plain, black and white checkered pants, a starlit night over snow-covered peaks, khaki triangles and azure crescents, a maroon dodecahedron interwoven with teal threads. Please return the 100 topics in csv format. """
prompt = f""" Generate SVG code to visually represent the following text description, while respecting the given constraints. Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs` Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity` Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. Focus on a clear and concise representation of the input description within the given limitations. Always give the complete SVG code with nothing omitted. Never use an ellipsis. The code is scored based on similarity to the description, Visual question anwering and aesthetic components. Please generate a detailed svg code accordingly. input description: {text} """
The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.
Data Set Description
The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.
Directory Contents
The label files are formatted as below:
"Image-file-name class_in_integer_representation"
Labeling Process
Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:
Classes
There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:
Class name, counts (training set), counts (validation set), counts (test set), integer representation
Arm cover, 10, 1, 4, 0
Other rover part, 190, 11, 10, 1
Artifact, 680, 62, 132, 2
Nearby surface, 1554, 74, 187, 3
Close-up rock, 1422, 50, 84, 4
DRT, 8, 4, 6, 5
DRT spot, 214, 1, 7, 6
Distant landscape, 342, 14, 34, 7
Drill hole, 252, 5, 12, 8
Night sky, 40, 3, 4, 9
Float, 190, 5, 1, 10
Layers, 182, 21, 17, 11
Light-toned veins, 42, 4, 27, 12
Mastcam cal target, 122, 12, 29, 13
Sand, 228, 19, 16, 14
Sun, 182, 5, 19, 15
Wheel, 212, 5, 5, 16
Wheel joint, 62, 1, 5, 17
Wheel tracks, 26, 3, 1, 18
Image Augmentation
Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.
Acknowledgment
The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
with 600 sets of data in the training set and 200 sets of data in the test set
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
scaled and modified to represent a number a training set dataset.It can be used to detect and identify object type based on material type in the image.In this process both training data set and test data set can be generated from these image files.
aisi-whitebox/test-collection-to-training-set dataset hosted on Hugging Face and contributed by the HF Datasets community
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.
The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Antiviral peptides (AVPs) are bioactive peptides that exhibit the inhibitory activity against viruses through a range of mechanisms. Virus entry inhibitory peptides (VEIPs) make up a specific class of AVPs that can prevent envelope viruses from entering cells. With the growing number of experimentally verified VEIPs, there is an opportunity to use machine learning to predict peptides that inhibit the virus entry. In this paper, we have developed the first target-specific prediction model for the identification of new VEIPs using, along with the peptide sequence characteristics, the attributes of the envelope proteins of the target virus, which overcomes the problem of insufficient data for particular viral strains and improves the predictive ability. The model’s performance was evaluated through 10 repeats of 10-fold cross-validation on the training data set, and the results indicate that it can predict VEIPs with 87.33% accuracy and Matthews correlation coefficient (MCC) value of 0.76. The model also performs well on an independent test set with 90.91% accuracy and MCC of 0.81. We have also developed an automatic computational tool that predicts VEIPs, which is freely available at https://dbaasp.org/tools?page=linear-amp-prediction.
Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.
Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.
Data Dictionary | Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
pclass: Proxy for socio-economic status (SES):
1st = Upper
2nd = Middle
3rd = Lower
age:
Fractional if less than 1 year.
Estimated ages are represented in the form xx.5.
sibsp: Defines family relations as:
Sibling: Brother, sister, stepbrother, stepsister.
Spouse: Husband, wife (excluding mistresses and fiancés).
parch: Defines family relations as:
Parent: Mother, father.
Child: Daughter, son, stepdaughter, stepson.
Some children traveled only with a nanny, so parch = 0 for them.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Including the split of real and null reactions for training, validation and test