100+ datasets found

Combined Train and test dataset of house pricing
kaggle.com
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nipun sharma (2024). Combined Train and test dataset of house pricing [Dataset]. https://www.kaggle.com/datasets/nipun356/combined-train-and-test-dataset-of-house-pricing/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nipun sharma
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
If you're working on a house price prediction project, it's common to have both training and testing datasets that contain valuable information about different properties. The training set is utilized to train your machine learning model, allowing it to learn patterns and relationships within the data, while the testing set is held back to evaluate how well the model generalizes to new, unseen data.

However, in certain scenarios, practitioners may choose to combine the training and testing datasets into a single dataset for efficiency and convenience. This combined dataset approach aims to streamline the coding process, potentially saving time by avoiding the need to manage and preprocess two separate datasets. This can be particularly beneficial in situations where rapid prototyping or exploratory data analysis is the primary focus.

Additionally, the process of feature engineering, which involves transforming raw data into meaningful features, might become more intricate when dealing with a unified dataset. Insights gained from the testing set could influence decisions made during the training phase, potentially compromising the model's ability to accurately predict house prices for new instances.

If you decide to proceed with a combined dataset, careful steps must be taken to mitigate potential issues, such as handling data preprocessing, missing values, and feature scaling separately for the training and testing portions. Additionally, it's essential to be cautious with evaluation metrics and consider techniques like cross-validation applied exclusively to the training data to ensure a robust and unbiased assessment of your model's performance.
Dataset, splits, models, and scripts for the QM descriptors prediction
zenodo.org
application/gzip
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10668491
Dataset updated
Apr 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

Below are descriptions of the available scripts:

atom_bond_descriptors.sh: Trains atom/bond targets.

atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.

dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.

dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.

energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).

energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.

get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.

csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

Below is the procedure for running the ml-QM-GNN on your own dataset:

Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.

Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.

Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).

Run Chemprop to train your models using the additional predicted features supported here.
N
Replication Data for: Few-Shot and Continual Learning with Attentive...
dataverse.lib.nycu.edu.tw
bin, gif, png, sh +4
Updated Jun 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYCU Dataverse (2022). Replication Data for: Few-Shot and Continual Learning with Attentive Independent Mechanisms [Dataset]. http://doi.org/10.57770/A17ZGB
Explore at:
sh(459), bin(1259), text/markdown(1139), gif(330235), png(39406), text/plain; charset=us-ascii(1067), txt(32), text/x-python(30)Available download formats
Unique identifier
https://doi.org/10.57770/A17ZGB
Dataset updated
Jun 16, 2022
Dataset provided by
NYCU Dataverse
License
https://dataverse.lib.nycu.edu.tw/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57770/A17ZGBhttps://dataverse.lib.nycu.edu.tw/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57770/A17ZGB
Description
Deep neural networks (DNNs) are known to perform well when deployed to test distributions that shares high similarity with the training distribution. Feeding DNNs with new data sequentially that were unseen in the training distribution has two major challenges — fast adaptation to new tasks and catastrophic forgetting of old tasks. Such difficulties paved way for the on-going research on few-shot learning and continual learning. To tackle these problems, we introduce Attentive Independent Mechanisms (AIM). We incorporate the idea of learning using fast and slow weights in conjunction with the decoupling of the feature extraction and higher-order conceptual learning of a DNN. AIM is designed for higher-order conceptual learning, modeled by a mixture of experts that compete to learn independent concepts to solve a new task. AIM is a modular component that can be inserted into existing deep learning frameworks. We demonstrate its capability for few-shot learning by adding it to SIB and trained on MiniImageNet and CIFAR-FS, showing significant improvement. AIM is also applied to ANML and OML trained on Omniglot, CIFAR-100 and MiniImageNet to demonstrate its capability in continual learning.
f
Comparison of proposed model using unseen data.
plos.figshare.com
xls
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saim Chishti; Faryal Nosheen; Joddat Fatima; Nadia Sultan; Madiha Khalid (2025). Comparison of proposed model using unseen data. [Dataset]. http://doi.org/10.1371/journal.pone.0331985.t015
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0331985.t015
Dataset updated
Sep 18, 2025
Dataset provided by
PLOS ONE
Authors
Saim Chishti; Faryal Nosheen; Joddat Fatima; Nadia Sultan; Madiha Khalid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thalassemia is an inherited blood disorder and is among the five most prevalent birth-related complications, especially in Southeast Asia. Thalassemia is classified into two main types—alpha-thalassemia and beta-thalassemia—based on the reduced or absent production of the corresponding globin chains. Over the past couple of decades, researchers have increasingly focused on the application of machine learning algorithms to medical data for identifying hidden patterns to assist in the prediction and classification of diseases and patients. To effectively analyze more complex medical data, more robust machine learning models have been developed to address various health issues. Many researchers have employed different artificial intelligence-based algorithms, i.e., Random Forest, Decision Tree, Support Vector Machine, ensemble-based classifiers, and deep neural networks to accurately detect carriers of beta-thalassemia by training on both diseased and normal test reports. While genetic testing is required by doctors for the most accurate diagnosis, a simple Complete Blood Count (CBC) report can be used to estimate the likelihood of being a beta-thalassemia carrier. Various models have successfully identified beta-thalassemia carriers using CBC data alone, but these models perform classification and prediction based on normalized data. They achieve high accuracy but at the cost of substantial changes to the dataset through class normalization. In this research, we have proposed a Dominance-based Rough Set Approach model to classify patients without balancing the classes (Normal, Abnormal), and the model achieved good performance (91% accuracy). In terms of generalization, the proposed model obtained 89% accuracy on unseen data, comparable to or better than existing approaches.
Blood Cell Detection(BCD)
kaggle.com
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Noukhez (2024). Blood Cell Detection(BCD) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/blood-cell-detection-bcd/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Noukhez
Description
Dataset Description:

This dataset is a comprehensive collection of Blood Cell Detection (BCD) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

Dataset Breakdown:

Training Images: 255

Validation Images: 73

Test Images: 36

Annotations: Detailed annotations included for all images

Overview:

Blood Cell Detection (BCD) is an essential task in medical diagnostics, aiding in the evaluation of overall health and the detection of various disorders such as infections, anemia, and blood cancers. This dataset provides a rich source of BCD images that can be used to train machine learning models to automate the detection and analysis of different types of blood cells.

Data Composition:

Training Set:

Contains 255 images

These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.

Validation Set:

Contains 73 images

Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.

Test Set:

Contains 36 images

Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

Annotations:

Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

Key Features:

High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.

Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.

Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

Applications:

This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated Blood Cell Detection: Developing algorithms to automatically detect and analyze blood cells and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of blood cell detection and analysis.

Usage Notes:

Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.

Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.

Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

Conclusion:

This BCD dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for blood cell detection.
Test 2 Behaviour Data
figshare.com
bin
Updated Jun 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justine Pearce (2024). Test 2 Behaviour Data [Dataset]. http://doi.org/10.6084/m9.figshare.26084722.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26084722.v1
Dataset updated
Jun 22, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Justine Pearce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We applied a random forest algorithm to process accelerometer data from broiler chickens. Data from three broiler strains at a range of ages (from 25-49 days old) were used to train and test the algorithm and, unlike other studies, the algorithm was further tested on an unseen broiler strain. When tested on unseen birds from the three training broiler strains the random forest model classified behaviours with very good accuracy (92%), specificity (94%) and good sensitivity (88%) and precision (88%). With the new, unseen strain the model classified behaviours with very good accuracy (94%), sensitivity (91%), specificity (96%) and precision (91%).
RSICD Image Caption Dataset
kaggle.com
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). RSICD Image Caption Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/rsicd-image-caption-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
RSICD Image Caption Dataset

RSICD Image Caption Dataset

By Arto (From Huggingface) [source]

About this dataset

The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.

Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.

Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.

Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.

Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »

How to use the dataset

Overview of the Dataset

The dataset consists of three primary files: train.csv, test.csv, and valid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.

Understanding the Files

train.csv: This file contains filenames (filename column) and their corresponding captions (captions column) for training your image captioning model.

test.csv: The test set is included in this file, which contains a similar structure as that of train.csv. The purpose of this file is to evaluate your trained models on unseen data.

valid.csv: This validation set provides images with their respective filenames (filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.

Getting Started

To begin utilizing this dataset effectively, follow these steps:

Extract the zip file containing all relevant data files onto your local machine or cloud environment.

Familiarize yourself with each CSV file's structure: train.csv, test.csv, and valid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).

Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).

Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.

Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.

Split the data into training, validation, and test sets according to your experimental design requirements.

Use appropriate algorithms and techniques to train your image captioning models on the provided data.

Enhancing Model Performance

To optimize model performance using this dataset, consider these tips:

Explore different architectures and pre-trained models specifically designed for image captioning tasks.

Experiment with various natural language

Research Ideas

Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.

Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.

Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
d
Data from: Prediction models in the design of neural network based ECG...
catalog.data.gov
data.virginia.gov
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Prediction models in the design of neural network based ECG classifiers: A neural network and genetic programming approach [Dataset]. https://catalog.data.gov/dataset/prediction-models-in-the-design-of-neural-network-based-ecg-classifiers-a-neural-network-a
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Background Classification of the electrocardiogram using Neural Networks has become a widely used method in recent years. The efficiency of these classifiers depends upon a number of factors including network training. Unfortunately, there is a shortage of evidence available to enable specific design choices to be made and as a consequence, many designs are made on the basis of trial and error. In this study we develop prediction models to indicate the point at which training should stop for Neural Network based Electrocardiogram classifiers in order to ensure maximum generalisation. Methods Two prediction models have been presented; one based on Neural Networks and the other on Genetic Programming. The inputs to the models were 5 variable training parameters and the output indicated the point at which training should stop. Training and testing of the models was based on the results from 44 previously developed bi-group Neural Network classifiers, discriminating between Anterior Myocardial Infarction and normal patients. Results Our results show that both approaches provide close fits to the training data; p = 0.627 and p = 0.304 for the Neural Network and Genetic Programming methods respectively. For unseen data, the Neural Network exhibited no significant differences between actual and predicted outputs (p = 0.306) while the Genetic Programming method showed a marginally significant difference (p = 0.047). Conclusions The approaches provide reverse engineering solutions to the development of Neural Network based Electrocardiogram classifiers. That is given the network design and architecture, an indication can be given as to when training should stop to obtain maximum network generalisation.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Test 1
figshare.com
txt
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justine Pearce (2024). Test 1 [Dataset]. http://doi.org/10.6084/m9.figshare.26072830.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26072830.v1
Dataset updated
Jun 20, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Justine Pearce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We applied a random forest algorithm to process accelerometer data from broiler chickens. Data from three broiler strains at a range of ages (from 25-49 days old) were used to train and test the algorithm and, unlike other studies, the algorithm was further tested on an unseen broiler strain. When tested on unseen birds from the three training broiler strains the random forest model classified behaviours with very good accuracy (92%), specificity (94%) and good sensitivity (88%) and precision (88%). With the new, unseen strain the model classified behaviours with very good accuracy (94%), sensitivity (91%), specificity (96%) and precision (91%).
d
Using Decision Trees to Detect and Isolate Leaks in the J-2X
catalog.data.gov
s.cnmilf.com
+1more
Updated Aug 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Using Decision Trees to Detect and Isolate Leaks in the J-2X [Dataset]. https://catalog.data.gov/dataset/using-decision-trees-to-detect-and-isolate-leaks-in-the-j-2x
Explore at:
Dataset updated
Aug 30, 2025
Dataset provided by
Dashlink
Description
Full title: Using Decision Trees to Detect and Isolate Simulated Leaks in the J-2X Rocket Engine Mark Schwabacher, NASA Ames Research Center Robert Aguilar, Pratt & Whitney Rocketdyne Fernando Figueroa, NASA Stennis Space Center Abstract The goal of this work was to use data-driven methods to automatically detect and isolate faults in the J-2X rocket engine. It was decided to use decision trees, since they tend to be easier to interpret than other data-driven methods. The decision tree algorithm automatically “learns” a decision tree by performing a search through the space of possible decision trees to find one that fits the training data. The particular decision tree algorithm used is known as C4.5. Simulated J-2X data from a high-fidelity simulator developed at Pratt & Whitney Rocketdyne and known as the Detailed Real-Time Model (DRTM) was used to “train” and test the decision tree. Fifty-six DRTM simulations were performed for this purpose, with different leak sizes, different leak locations, and different times of leak onset. To make the simulations as realistic as possible, they included simulated sensor noise, and included a gradual degradation in both fuel and oxidizer turbine efficiency. A decision tree was trained using 11 of these simulations, and tested using the remaining 45 simulations. In the training phase, the C4.5 algorithm was provided with labeled examples of data from nominal operation and data including leaks in each leak location. From the data, it “learned” a decision tree that can classify unseen data as having no leak or having a leak in one of the five leak locations. In the test phase, the decision tree produced very low false alarm rates and low missed detection rates on the unseen data. It had very good fault isolation rates for three of the five simulated leak locations, but it tended to confuse the remaining two locations, perhaps because a large leak at one of these two locations can look very similar to a small leak at the other location. Introduction The J-2X rocket engine will be tested on Test Stand A-1 at NASA Stennis Space Center (SSC) in Mississippi. A team including people from SSC, NASA Ames Research Center (ARC), and Pratt & Whitney Rocketdyne (PWR) is developing a prototype end-to-end integrated systems health management (ISHM) system that will be used to monitor the test stand and the engine while the engine is on the test stand[1]. The prototype will use several different methods for detecting and diagnosing faults in the test stand and the engine, including rule-based, model-based, and data-driven approaches. SSC is currently using the G2 tool http://www.gensym.com to develop rule-based and model-based fault detection and diagnosis capabilities for the A-1 test stand. This paper describes preliminary results in applying the data-driven approach to detecting and diagnosing faults in the J-2X engine. The conventional approach to detecting and diagnosing faults in complex engineered systems such as rocket engines and test stands is to use large numbers of human experts. Test controllers watch the data in near-real time during each engine test. Engineers study the data after each test. These experts are aided by limit checks that signal when a particular variable goes outside of a predetermined range. The conventional approach is very labor intensive. Also, humans may not be able to recognize faults that involve the relationships among large numbers of variables. Further, some potential faults could happen too quickly for humans to detect them and react before they become catastrophic. Automated fault detection and diagnosis is therefore needed. One approach to automation is to encode human knowledge into rules or models. Another approach is use data-driven methods to automatically learn models from historical data or simulated data. Our prototype will combine the data-driven approach with the model-based and rule-based appro
The number of drugs in training-testing dataset for the 5-fold cross...
plos.figshare.com
xlsx
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
You Wu; Qiao Liu; Yue Qiu; Lei Xie (2023). The number of drugs in training-testing dataset for the 5-fold cross validation used in the drug side effect prediction task. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010367.s019
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1010367.s019
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
You Wu; Qiao Liu; Yue Qiu; Lei Xie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
320 drugs are studied in SIDER dataset and 323 drugs are used in FAERS dataset. Those drugs are found in both these two datasets and the selected low quality LINCS L1000 dataset. (XLSX)
d
Data and code from: AI-based image profiling and detection for the beetle...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: AI-based image profiling and detection for the beetle byte quintet using Vision Transformer (ViT) in advanced stored product infestation monitoring [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-ai-based-image-profiling-and-detection-for-the-beetle-byte-quintet-usin
Explore at:
Dataset updated
Sep 2, 2025
Dataset provided by
Agricultural Research Service
Description
Managing beetles that infest stored products is crucial for reducing losses in harvest supply chains and improving food security and safety. Successful pest management programs require effective and timely monitoring programs, but traditional methods for detecting pests are time- and labor-intensive and require taxonomic expertise. New, automated methods using computer vision have the potential to improve accuracy and speed of detection, but often struggle to differentiate between beetle species, which tend to be small and morphologically similar. Our research centers on five economically significant beetle species, referred to as the 'Beetle Byte Quintet,' and proposes a novel methodology leveraging Vision Transformers (ViT) to enhance the precision and robustness of their classification. The method involves using an image profiling technique to capture morphological characteristics like body shape, color and exoskeleton structures that are key for distinguishing between species. By utilizing this species profiling, the ViT model achieved an accuracy rate of over 99.34% during training and 96.57% during testing. These findings highlight the model’s ability to generalize and maintain precision with new unseen data surpassing traditional computer vision algorithms significantly. The integration of ViT can help enable real time monitoring and is adaptable to a range of pest monitoring solutions for large scale storage settings which addresses the complexities of environments. This AI driven approach not only simplifies species identification but also promotes accurate and targeted pest control practices leading to reduced economic losses and improved food security.A subsample of images used in the model are included here for Rhyzopertha dominica (lesser grain borer), Sitophilus zeamais (maize weevil), Tribolium castaneum (red flour beetle), Cryptolestes ferrugineus (rusty grain beetle), and Oryzaephilus surinamensis (sawtoothed grain beetle). Custom MatLab code and a data descriptor README are also included.
TreeSatAI Benchmark Archive for Deep Learning in Forest Applications
zenodo.org
data.niaid.nih.gov
bin, pdf, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Schulz; Christian Schulz; Steve Ahlswede; Steve Ahlswede; Christiano Gava; Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Florencia Arias; Michael Förster; Michael Förster; Jörn Hees; Jörn Hees; Begüm Demir; Begüm Demir; Birgit Kleinschmit; Birgit Kleinschmit; Christiano Gava; Florencia Arias (2024). TreeSatAI Benchmark Archive for Deep Learning in Forest Applications [Dataset]. http://doi.org/10.5281/zenodo.6598391
Explore at:
pdf, zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6598391
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Schulz; Christian Schulz; Steve Ahlswede; Steve Ahlswede; Christiano Gava; Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Florencia Arias; Michael Förster; Michael Förster; Jörn Hees; Jörn Hees; Begüm Demir; Begüm Demir; Birgit Kleinschmit; Birgit Kleinschmit; Christiano Gava; Florencia Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context and Aim

Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.

We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.

Description

The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.

The TreeSatAI Benchmark Archive contains:

50,381 image triplets (aerial, Sentinel-1, Sentinel-2)

synchronized time steps and locations

all original spectral bands/polarizations from the sensors

20 species classes (single labels)

12 age classes (single labels)

15 genus classes (multi labels)

60 m and 200 m patches

fixed split for train (90%) and test (10%) data

additional single labels such as English species name, genus, forest stand type, foliage type, land cover

The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.

Version history

v1.0.0 - First release

Citation

Ahlswede et al. (in prep.)

GitHub

Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.

Folder structure

We refer to the proposed folder structure in the PDF file.

Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.

Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.

Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.

The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]

The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.

The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).

CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),

Join the archive

Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.

Project description

This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).

Publications

Ahlswede et al. (2022, in prep.): TreeSatAI Dataset Publication

Ahlswede S., Nimisha, T.M., and Demir, B. (2022, in revision): Embedded Self-Enhancement Maps for Weakly Supervised Tree Species Mapping in Remote Sensing Images. IEEE Trans Geosci Remote Sens

Schulz et al. (2022, in prep.): Phenoprofiling

Conference contributions

S. Ahlswede, N. T. Madam, C. Schulz, B. Kleinschmit and B. Demіr, "Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods", IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, “Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series”, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

C. Schulz, M. Förster, S. Vulova and B. Kleinschmit, “The temporal fingerprints of common European forest types from SAR and optical remote sensing data”, AGU Fall Meeting, New Orleans, USA, 2021.

B. Kleinschmit, M. Förster, C. Schulz, F. Arias, B. Demir, S. Ahlswede, A. K. Aksoy, T. Ha Minh, J. Hees, C. Gava, P. Helber, B. Bischke, P. Habelitz, A. Frick, R. Klinke, S. Gey, D. Seidel, S. Przywarra, R. Zondag and B. Odermatt, “Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests”, Living Planet Symposium, Bonn, Germany, 2022.

C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, (2022, submitted): “Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series”, ForestSAT, Berlin, Germany, 2022.
n
GameOfLife Prediction Dataset
data.ncl.ac.uk
txt
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Towers; Linus Ericsson; Amir Atapour-Abarghouei; Elliot J Crowley; Andrew Stephen McGough (2025). GameOfLife Prediction Dataset [Dataset]. http://doi.org/10.25405/data.ncl.30000835.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.30000835.v1
Dataset updated
Sep 10, 2025
Dataset provided by
Newcastle University
Authors
David Towers; Linus Ericsson; Amir Atapour-Abarghouei; Elliot J Crowley; Andrew Stephen McGough
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GameOfLife dataset is an algorithmically generated dataset based off John Horton Conway's Game of Life. Conway's Game of Life follows a strict set off rules at each "generation" (simulation step) where cells alternate between a dead and alive state based on number of surrounding alive cells. These rules can be found on the Game of Life's Wikipedia pageThis dataset is one of the three hidden datasets used by the 2025 NAS Unseen-Data Challenge at AutoML.The goal of this dataset is to predict the number of cells alive in the next generation. This task is relatively simple for a human to do if a bit tedious, and should theoretically be simple for Machine Learning algorithms. Each cells's state is calculated based off the number of alive neighbour's in the previous step. Effectively for every cell we only need to look at the surrounding eight cells (3x3 square, minus the centre) which means all information for each cell can be found from a 3x3 Convolution, which is a very common kernel size to use. The dataset was used to make sure that participants appraoches could handle simple tasks along with the more complicated tasks to make sure they did not overcomplicate their submission.There are 70,000 images in the dataset where each image is a randomly generated starting configuration of the Game of Life, with a random level of density (number of initial alive cells).The data is stored in a channels-first format with a shape of (n, 1, 10, 10) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are 25 classes in this dataset, where the label (0..24) represents the number of alive celss in the next generation and images are evenly distributed by class across the dataset (2800 each, 2000, 400, 400 for training, validation and testing respectively). We limit the data to 25 classes despite theoretically a limit of 0-100, we do this as the higher classes are increasingly unlikely to occur, and would take much longer to create a balanced dataset. Excluding 0, the lower numbers also get increasingly unlikely, though more likely than higher numbers, we wanted to prevent gaps and therefore limited to 25 contiguous classesNumPy (.npy) files can be opened through the NumPy Python library, using the numpy.load() function by inputting the path to the file into the function as a parameter. The metadata file contains some basic information about the datasets, and can be opened in many text editors such as vim, nano, notepad++, notepad, etc.
A
‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-solution-for-beginner-s-guide-78b3/db683166/?iid=041-447&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Overview

The data has been split into two groups:

training set (train.csv) test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

--- Original source retains full ownership of the source dataset ---
f
Comparison of the performance of the predictive models using the training...
plos.figshare.com
xls
Updated Sep 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatima Zahra Rhmari Tlemçani; Faiçal Aitlahbib; Soukaina Laidi; Najib Alidrissi; Jehanne Aasfara; Saloua Elamari; Asma Chadli; Imane Motaib (2025). Comparison of the performance of the predictive models using the training dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0332442.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0332442.t002
Dataset updated
Sep 26, 2025
Dataset provided by
PLOS ONE
Authors
Fatima Zahra Rhmari Tlemçani; Faiçal Aitlahbib; Soukaina Laidi; Najib Alidrissi; Jehanne Aasfara; Saloua Elamari; Asma Chadli; Imane Motaib
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the performance of the predictive models using the training dataset.
t
Clay, Viviane (2021). Dataset: Data from neural network training in the...
service.tib.eu
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Clay, Viviane (2021). Dataset: Data from neural network training in the obstacle tower environment to investigate embodied, weakly supervised learning. https://doi.org/10.26249/FK2/BFDUZO [Dataset]. https://service.tib.eu/ldmservice/dataset/osn-doi-10-26249-fk2-bfduzo
Explore at:
Dataset updated
May 16, 2025
Description
Abstract: Description: This repository presents data collected to investigate the role of embodiment and supervision in learning. This is done inside a simulated 3D maze world with a navigation task using mainly visual input in the form of RGB images. The main contribution of this data repository is to provide a network model trained in this environment with weak supervision and a closed loop between action and perception. Additionally, control networks are provided which were trained with varying degrees of supervision and embodiment. In the corresponding paper [1] the representations of these networks are compared based on sparsity measures and well as content of the encodings and the possibility to extract semantic labels. For the training of the control conditions several new data sets were created which are also included here. They contain a collection of images from the simulated world with corresponding semantic labels. Overall, they provide a good basis for further analysis and a more in-depth investigation of representation learning and the effect of embodiment and supervision on representations. Abstract: Steps to reproduce: Data was generated through a 3D simulation of a maze environment called Obstacle Tower. The data of interest are the trained neural network weights and the networks activations corresponding with different input frames. Three main networks were trained. A reinforcement learning agent which trained through interaction with the simulated environment, an autoencoder trained to reconstruct images collected by the agent and a classifier, trained to classify objects in the images. Exact training and testing conditions, hyperparameter and network structure are provided in the corresponding paper. For the training of the reinforcement learning agent the Unity ml-agents toolkit PPO implementation is used with small modifications for extra data collection and control experiments. The code we used can be found here: https://github.com/vkakerbeck/ml-agents-dev . Model checkpoint files are saved for different points in training but mostly the final version of the network is analysed in the corresponding paper [1] . The autoencoder and classifier are trained using Python with TensorFlow and Keras. The corresponding code can be found here: https://github.com/vkakerbeck/Learning-World-Representations/tree/master/DataAnalysis . The data also contains activations in the hidden layer of the network corresponding to 4000 test images for all three networks. Code for this can be found in the same GitHub repository. The datasets used for training the autoencoder and classifier were created by collecting observations in the Obstacle Tower environment using the trained agent. These observations were then labelled automatically, and the labels were cross checked by hand. A Description of the individual files is included in the data folder (Description.txt). Due to storage constraints no all model checkpoint files used to create figure 6 of the paper could be uploaded. However, feel free to contact me (vkakerbeck[at]uos.de) if you are intrested in these detailed checkpoint files of the controll runs and I will make them available to you.
a
LFW Deep Funneled Test Set
datasets.activeloop.ai
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gary Huang (2022). LFW Deep Funneled Test Set [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/lfw-deep-funneled-dataset/
Explore at:
Dataset updated
Feb 3, 2022
Authors
Gary Huang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The test set contains 3,000 images, which were drawn from the same distribution as the training set. The test set is used to evaluate the performance of machine learning models on unseen data.
IPL_WIN_PREDICTION (98% ACCURACY)
kaggle.com
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JILAGAM KOTI SIVA NAGA CHARAN (2023). IPL_WIN_PREDICTION (98% ACCURACY) [Dataset]. https://www.kaggle.com/datasets/charanjilagam/iplcleanedata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JILAGAM KOTI SIVA NAGA CHARAN
Description
With the preprocessed dataset at hand, we can now move forward with a variety of actions depending on the nature of the data and the specific goals of the analysis. Some common actions might include:

Exploratory Data Analysis (EDA): We can begin by exploring the dataset to gain insights and a better understanding of its structure, contents, and statistical properties. This can involve tasks such as computing summary statistics, visualizing distributions, detecting outliers, and identifying patterns or relationships between variables.

Feature Engineering: If the dataset contains raw data or basic features, we can create new features that may be more informative or suitable for the specific analysis. This can involve mathematical transformations, combining existing features, or extracting relevant information from text or timestamps.

Model Training: With the preprocessed dataset, we can proceed with training machine learning models to perform various tasks such as classification, regression, clustering, or recommendation. This typically involves splitting the data into training and testing sets, selecting appropriate models, and optimizing their parameters to achieve the best performance.

Model Evaluation: Once the models are trained, we can evaluate their performance using appropriate metrics such as accuracy, precision, recall, or mean squared error. This allows us to assess how well the models generalize to unseen data and make informed decisions about their effectiveness.

Predictions and Inference: Using the trained models, we can make predictions or perform inference on new or unseen data points. This can be valuable for tasks such as making predictions about future events, identifying anomalies, or generating recommendations based on user preferences.

Visualization and Reporting: To communicate the findings and results effectively, we can create visualizations, reports, or interactive dashboards summarizing the analysis. This helps stakeholders understand the insights and make informed decisions based on the data.

By leveraging the preprocessed dataset, we can streamline our analysis and focus on extracting meaningful insights or solving specific problems without the need for extensive data cleaning and preprocessing steps.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nipun sharma (2024). Combined Train and test dataset of house pricing [Dataset]. https://www.kaggle.com/datasets/nipun356/combined-train-and-test-dataset-of-house-pricing/code

Combined Train and test dataset of house pricing

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 27, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Nipun sharma

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

If you're working on a house price prediction project, it's common to have both training and testing datasets that contain valuable information about different properties. The training set is utilized to train your machine learning model, allowing it to learn patterns and relationships within the data, while the testing set is held back to evaluate how well the model generalizes to new, unseen data.

However, in certain scenarios, practitioners may choose to combine the training and testing datasets into a single dataset for efficiency and convenience. This combined dataset approach aims to streamline the coding process, potentially saving time by avoiding the need to manage and preprocess two separate datasets. This can be particularly beneficial in situations where rapid prototyping or exploratory data analysis is the primary focus.

Additionally, the process of feature engineering, which involves transforming raw data into meaningful features, might become more intricate when dealing with a unified dataset. Insights gained from the testing set could influence decisions made during the training phase, potentially compromising the model's ability to accurately predict house prices for new instances.

If you decide to proceed with a combined dataset, careful steps must be taken to mitigate potential issues, such as handling data preprocessing, missing values, and feature scaling separately for the training and testing portions. Additionally, it's essential to be cautious with evaluation metrics and consider techniques like cross-validation applied exclusively to the training data to ensure a robust and unbiased assessment of your model's performance.

Clear search

Close search

Google apps

Main menu

Combined Train and test dataset of house pricing

Dataset, splits, models, and scripts for the QM descriptors prediction

Replication Data for: Few-Shot and Continual Learning with Attentive...

Comparison of proposed model using unseen data.

Blood Cell Detection(BCD)

Dataset Breakdown:

Overview:

Data Composition:

Annotations:

Key Features:

Applications:

Usage Notes:

Conclusion:

Test 2 Behaviour Data

RSICD Image Caption Dataset

RSICD Image Caption Dataset

RSICD Image Caption Dataset

About this dataset

How to use the dataset

Overview of the Dataset

Understanding the Files

Getting Started

Enhancing Model Performance

Research Ideas

Data from: Prediction models in the design of neural network based ECG...

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Test 1

Using Decision Trees to Detect and Isolate Leaks in the J-2X

The number of drugs in training-testing dataset for the 5-fold cross...

Data and code from: AI-based image profiling and detection for the beetle...

TreeSatAI Benchmark Archive for Deep Learning in Forest Applications

GameOfLife Prediction Dataset

‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2

Overview

Data Dictionary

Variable Notes

Comparison of the performance of the predictive models using the training...

Clay, Viviane (2021). Dataset: Data from neural network training in the...

LFW Deep Funneled Test Set

IPL_WIN_PREDICTION (98% ACCURACY)

Combined Train and test dataset of house pricing