Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
If you're working on a house price prediction project, it's common to have both training and testing datasets that contain valuable information about different properties. The training set is utilized to train your machine learning model, allowing it to learn patterns and relationships within the data, while the testing set is held back to evaluate how well the model generalizes to new, unseen data.
However, in certain scenarios, practitioners may choose to combine the training and testing datasets into a single dataset for efficiency and convenience. This combined dataset approach aims to streamline the coding process, potentially saving time by avoiding the need to manage and preprocess two separate datasets. This can be particularly beneficial in situations where rapid prototyping or exploratory data analysis is the primary focus.
Additionally, the process of feature engineering, which involves transforming raw data into meaningful features, might become more intricate when dealing with a unified dataset. Insights gained from the testing set could influence decisions made during the training phase, potentially compromising the model's ability to accurately predict house prices for new instances.
If you decide to proceed with a combined dataset, careful steps must be taken to mitigate potential issues, such as handling data preprocessing, missing values, and feature scaling separately for the training and testing portions. Additionally, it's essential to be cautious with evaluation metrics and consider techniques like cross-validation applied exclusively to the training data to ensure a robust and unbiased assessment of your model's performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.
Below are descriptions of the available scripts:
atom_bond_descriptors.sh
: Trains atom/bond targets.atom_bond_descriptors_predict.sh
: Predicts atom/bond targets from pre-trained model.dipole_quadrupole_moments.sh
: Trains dipole and quadrupole moments.dipole_quadrupole_moments_predict.sh
: Predicts dipole and quadrupole moments from pre-trained model.energy_gaps_IP_EA.sh
: Trains energy gaps, ionization potential (IP), and electron affinity (EA).energy_gaps_IP_EA_predict.sh
: Predicts energy gaps, IP, and EA from pre-trained model.get_constraints.py
: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.csv2pkl.py
: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.Below is the procedure for running the ml-QM-GNN on your own dataset:
get_constraints.py
to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.atom_bond_descriptors_predict.sh
to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh
and energy_gaps_IP_EA_predict.sh
to calculate molecular QM descriptors.csv2pkl.py
to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).https://dataverse.lib.nycu.edu.tw/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57770/A17ZGBhttps://dataverse.lib.nycu.edu.tw/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57770/A17ZGB
Deep neural networks (DNNs) are known to perform well when deployed to test distributions that shares high similarity with the training distribution. Feeding DNNs with new data sequentially that were unseen in the training distribution has two major challenges — fast adaptation to new tasks and catastrophic forgetting of old tasks. Such difficulties paved way for the on-going research on few-shot learning and continual learning. To tackle these problems, we introduce Attentive Independent Mechanisms (AIM). We incorporate the idea of learning using fast and slow weights in conjunction with the decoupling of the feature extraction and higher-order conceptual learning of a DNN. AIM is designed for higher-order conceptual learning, modeled by a mixture of experts that compete to learn independent concepts to solve a new task. AIM is a modular component that can be inserted into existing deep learning frameworks. We demonstrate its capability for few-shot learning by adding it to SIB and trained on MiniImageNet and CIFAR-FS, showing significant improvement. AIM is also applied to ANML and OML trained on Omniglot, CIFAR-100 and MiniImageNet to demonstrate its capability in continual learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thalassemia is an inherited blood disorder and is among the five most prevalent birth-related complications, especially in Southeast Asia. Thalassemia is classified into two main types—alpha-thalassemia and beta-thalassemia—based on the reduced or absent production of the corresponding globin chains. Over the past couple of decades, researchers have increasingly focused on the application of machine learning algorithms to medical data for identifying hidden patterns to assist in the prediction and classification of diseases and patients. To effectively analyze more complex medical data, more robust machine learning models have been developed to address various health issues. Many researchers have employed different artificial intelligence-based algorithms, i.e., Random Forest, Decision Tree, Support Vector Machine, ensemble-based classifiers, and deep neural networks to accurately detect carriers of beta-thalassemia by training on both diseased and normal test reports. While genetic testing is required by doctors for the most accurate diagnosis, a simple Complete Blood Count (CBC) report can be used to estimate the likelihood of being a beta-thalassemia carrier. Various models have successfully identified beta-thalassemia carriers using CBC data alone, but these models perform classification and prediction based on normalized data. They achieve high accuracy but at the cost of substantial changes to the dataset through class normalization. In this research, we have proposed a Dominance-based Rough Set Approach model to classify patients without balancing the classes (Normal, Abnormal), and the model achieved good performance (91% accuracy). In terms of generalization, the proposed model obtained 89% accuracy on unseen data, comparable to or better than existing approaches.
Dataset Description:
This dataset is a comprehensive collection of Blood Cell Detection (BCD) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.
Blood Cell Detection (BCD) is an essential task in medical diagnostics, aiding in the evaluation of overall health and the detection of various disorders such as infections, anemia, and blood cancers. This dataset provides a rich source of BCD images that can be used to train machine learning models to automate the detection and analysis of different types of blood cells.
Training Set:
Validation Set:
Test Set:
Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.
This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated Blood Cell Detection: Developing algorithms to automatically detect and analyze blood cells and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of blood cell detection and analysis.
This BCD dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for blood cell detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We applied a random forest algorithm to process accelerometer data from broiler chickens. Data from three broiler strains at a range of ages (from 25-49 days old) were used to train and test the algorithm and, unlike other studies, the algorithm was further tested on an unseen broiler strain. When tested on unseen birds from the three training broiler strains the random forest model classified behaviours with very good accuracy (92%), specificity (94%) and good sensitivity (88%) and precision (88%). With the new, unseen strain the model classified behaviours with very good accuracy (94%), sensitivity (91%), specificity (96%) and precision (91%).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Arto (From Huggingface) [source]
The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.
Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.
Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.
Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.
Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »
Overview of the Dataset
The dataset consists of three primary files:
train.csv
,test.csv
, andvalid.csv
. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.Understanding the Files
- train.csv: This file contains filenames (
filename
column) and their corresponding captions (captions
column) for training your image captioning model.- test.csv: The test set is included in this file, which contains a similar structure as that of
train.csv
. The purpose of this file is to evaluate your trained models on unseen data.- valid.csv: This validation set provides images with their respective filenames (
filename
) and captions (captions
). It allows you to fine-tune your models based on performance during evaluation.Getting Started
To begin utilizing this dataset effectively, follow these steps:
- Extract the zip file containing all relevant data files onto your local machine or cloud environment.
- Familiarize yourself with each CSV file's structure:
train.csv
,test.csv
, andvalid.csv
. Understand how information like filename(s) (filename
) corresponds with its respective caption(s) (captions
).- Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).
- Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.
- Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.
- Split the data into training, validation, and test sets according to your experimental design requirements.
- Use appropriate algorithms and techniques to train your image captioning models on the provided data.
Enhancing Model Performance
To optimize model performance using this dataset, consider these tips:
- Explore different architectures and pre-trained models specifically designed for image captioning tasks.
- Experiment with various natural language
- Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.
- Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.
- Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
Background Classification of the electrocardiogram using Neural Networks has become a widely used method in recent years. The efficiency of these classifiers depends upon a number of factors including network training. Unfortunately, there is a shortage of evidence available to enable specific design choices to be made and as a consequence, many designs are made on the basis of trial and error. In this study we develop prediction models to indicate the point at which training should stop for Neural Network based Electrocardiogram classifiers in order to ensure maximum generalisation. Methods Two prediction models have been presented; one based on Neural Networks and the other on Genetic Programming. The inputs to the models were 5 variable training parameters and the output indicated the point at which training should stop. Training and testing of the models was based on the results from 44 previously developed bi-group Neural Network classifiers, discriminating between Anterior Myocardial Infarction and normal patients. Results Our results show that both approaches provide close fits to the training data; p = 0.627 and p = 0.304 for the Neural Network and Genetic Programming methods respectively. For unseen data, the Neural Network exhibited no significant differences between actual and predicted outputs (p = 0.306) while the Genetic Programming method showed a marginally significant difference (p = 0.047). Conclusions The approaches provide reverse engineering solutions to the development of Neural Network based Electrocardiogram classifiers. That is given the network design and architecture, an indication can be given as to when training should stop to obtain maximum network generalisation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text
. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark
the code used to generate the benchmarkevaluation
evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We applied a random forest algorithm to process accelerometer data from broiler chickens. Data from three broiler strains at a range of ages (from 25-49 days old) were used to train and test the algorithm and, unlike other studies, the algorithm was further tested on an unseen broiler strain. When tested on unseen birds from the three training broiler strains the random forest model classified behaviours with very good accuracy (92%), specificity (94%) and good sensitivity (88%) and precision (88%). With the new, unseen strain the model classified behaviours with very good accuracy (94%), sensitivity (91%), specificity (96%) and precision (91%).
Full title: Using Decision Trees to Detect and Isolate Simulated Leaks in the J-2X Rocket Engine Mark Schwabacher, NASA Ames Research Center Robert Aguilar, Pratt & Whitney Rocketdyne Fernando Figueroa, NASA Stennis Space Center Abstract The goal of this work was to use data-driven methods to automatically detect and isolate faults in the J-2X rocket engine. It was decided to use decision trees, since they tend to be easier to interpret than other data-driven methods. The decision tree algorithm automatically “learns” a decision tree by performing a search through the space of possible decision trees to find one that fits the training data. The particular decision tree algorithm used is known as C4.5. Simulated J-2X data from a high-fidelity simulator developed at Pratt & Whitney Rocketdyne and known as the Detailed Real-Time Model (DRTM) was used to “train” and test the decision tree. Fifty-six DRTM simulations were performed for this purpose, with different leak sizes, different leak locations, and different times of leak onset. To make the simulations as realistic as possible, they included simulated sensor noise, and included a gradual degradation in both fuel and oxidizer turbine efficiency. A decision tree was trained using 11 of these simulations, and tested using the remaining 45 simulations. In the training phase, the C4.5 algorithm was provided with labeled examples of data from nominal operation and data including leaks in each leak location. From the data, it “learned” a decision tree that can classify unseen data as having no leak or having a leak in one of the five leak locations. In the test phase, the decision tree produced very low false alarm rates and low missed detection rates on the unseen data. It had very good fault isolation rates for three of the five simulated leak locations, but it tended to confuse the remaining two locations, perhaps because a large leak at one of these two locations can look very similar to a small leak at the other location. Introduction The J-2X rocket engine will be tested on Test Stand A-1 at NASA Stennis Space Center (SSC) in Mississippi. A team including people from SSC, NASA Ames Research Center (ARC), and Pratt & Whitney Rocketdyne (PWR) is developing a prototype end-to-end integrated systems health management (ISHM) system that will be used to monitor the test stand and the engine while the engine is on the test stand[1]. The prototype will use several different methods for detecting and diagnosing faults in the test stand and the engine, including rule-based, model-based, and data-driven approaches. SSC is currently using the G2 tool http://www.gensym.com to develop rule-based and model-based fault detection and diagnosis capabilities for the A-1 test stand. This paper describes preliminary results in applying the data-driven approach to detecting and diagnosing faults in the J-2X engine. The conventional approach to detecting and diagnosing faults in complex engineered systems such as rocket engines and test stands is to use large numbers of human experts. Test controllers watch the data in near-real time during each engine test. Engineers study the data after each test. These experts are aided by limit checks that signal when a particular variable goes outside of a predetermined range. The conventional approach is very labor intensive. Also, humans may not be able to recognize faults that involve the relationships among large numbers of variables. Further, some potential faults could happen too quickly for humans to detect them and react before they become catastrophic. Automated fault detection and diagnosis is therefore needed. One approach to automation is to encode human knowledge into rules or models. Another approach is use data-driven methods to automatically learn models from historical data or simulated data. Our prototype will combine the data-driven approach with the model-based and rule-based appro
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
320 drugs are studied in SIDER dataset and 323 drugs are used in FAERS dataset. Those drugs are found in both these two datasets and the selected low quality LINCS L1000 dataset. (XLSX)
Managing beetles that infest stored products is crucial for reducing losses in harvest supply chains and improving food security and safety. Successful pest management programs require effective and timely monitoring programs, but traditional methods for detecting pests are time- and labor-intensive and require taxonomic expertise. New, automated methods using computer vision have the potential to improve accuracy and speed of detection, but often struggle to differentiate between beetle species, which tend to be small and morphologically similar. Our research centers on five economically significant beetle species, referred to as the 'Beetle Byte Quintet,' and proposes a novel methodology leveraging Vision Transformers (ViT) to enhance the precision and robustness of their classification. The method involves using an image profiling technique to capture morphological characteristics like body shape, color and exoskeleton structures that are key for distinguishing between species. By utilizing this species profiling, the ViT model achieved an accuracy rate of over 99.34% during training and 96.57% during testing. These findings highlight the model’s ability to generalize and maintain precision with new unseen data surpassing traditional computer vision algorithms significantly. The integration of ViT can help enable real time monitoring and is adaptable to a range of pest monitoring solutions for large scale storage settings which addresses the complexities of environments. This AI driven approach not only simplifies species identification but also promotes accurate and targeted pest control practices leading to reduced economic losses and improved food security.A subsample of images used in the model are included here for Rhyzopertha dominica (lesser grain borer), Sitophilus zeamais (maize weevil), Tribolium castaneum (red flour beetle), Cryptolestes ferrugineus (rusty grain beetle), and Oryzaephilus surinamensis (sawtoothed grain beetle). Custom MatLab code and a data descriptor README are also included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context and Aim
Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.
We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.
Description
The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.
The TreeSatAI Benchmark Archive contains:
50,381 image triplets (aerial, Sentinel-1, Sentinel-2)
synchronized time steps and locations
all original spectral bands/polarizations from the sensors
20 species classes (single labels)
12 age classes (single labels)
15 genus classes (multi labels)
60 m and 200 m patches
fixed split for train (90%) and test (10%) data
additional single labels such as English species name, genus, forest stand type, foliage type, land cover
The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.
Version history
v1.0.0 - First release
Citation
Ahlswede et al. (in prep.)
GitHub
Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.
Folder structure
We refer to the proposed folder structure in the PDF file.
Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.
Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.
Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.
The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]
The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.
The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).
CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),
Join the archive
Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.
Project description
This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).
Publications
Ahlswede et al. (2022, in prep.): TreeSatAI Dataset Publication
Ahlswede S., Nimisha, T.M., and Demir, B. (2022, in revision): Embedded Self-Enhancement Maps for Weakly Supervised Tree Species Mapping in Remote Sensing Images. IEEE Trans Geosci Remote Sens
Schulz et al. (2022, in prep.): Phenoprofiling
Conference contributions
S. Ahlswede, N. T. Madam, C. Schulz, B. Kleinschmit and B. Demіr, "Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods", IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, “Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series”, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
C. Schulz, M. Förster, S. Vulova and B. Kleinschmit, “The temporal fingerprints of common European forest types from SAR and optical remote sensing data”, AGU Fall Meeting, New Orleans, USA, 2021.
B. Kleinschmit, M. Förster, C. Schulz, F. Arias, B. Demir, S. Ahlswede, A. K. Aksoy, T. Ha Minh, J. Hees, C. Gava, P. Helber, B. Bischke, P. Habelitz, A. Frick, R. Klinke, S. Gey, D. Seidel, S. Przywarra, R. Zondag and B. Odermatt, “Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests”, Living Planet Symposium, Bonn, Germany, 2022.
C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, (2022, submitted): “Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series”, ForestSAT, Berlin, Germany, 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GameOfLife dataset is an algorithmically generated dataset based off John Horton Conway's Game of Life. Conway's Game of Life follows a strict set off rules at each "generation" (simulation step) where cells alternate between a dead and alive state based on number of surrounding alive cells. These rules can be found on the Game of Life's Wikipedia pageThis dataset is one of the three hidden datasets used by the 2025 NAS Unseen-Data Challenge at AutoML.The goal of this dataset is to predict the number of cells alive in the next generation. This task is relatively simple for a human to do if a bit tedious, and should theoretically be simple for Machine Learning algorithms. Each cells's state is calculated based off the number of alive neighbour's in the previous step. Effectively for every cell we only need to look at the surrounding eight cells (3x3 square, minus the centre) which means all information for each cell can be found from a 3x3 Convolution, which is a very common kernel size to use. The dataset was used to make sure that participants appraoches could handle simple tasks along with the more complicated tasks to make sure they did not overcomplicate their submission.There are 70,000 images in the dataset where each image is a randomly generated starting configuration of the Game of Life, with a random level of density (number of initial alive cells).The data is stored in a channels-first format with a shape of (n, 1, 10, 10) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are 25 classes in this dataset, where the label (0..24) represents the number of alive celss in the next generation and images are evenly distributed by class across the dataset (2800 each, 2000, 400, 400 for training, validation and testing respectively). We limit the data to 25 classes despite theoretically a limit of 0-100, we do this as the higher classes are increasingly unlikely to occur, and would take much longer to create a balanced dataset. Excluding 0, the lower numbers also get increasingly unlikely, though more likely than higher numbers, we wanted to prevent gaps and therefore limited to 25 contiguous classesNumPy (.npy) files can be opened through the NumPy Python library, using the numpy.load()
function by inputting the path to the file into the function as a parameter. The metadata file contains some basic information about the datasets, and can be opened in many text editors such as vim, nano, notepad++, notepad, etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 30 September 2021.
--- Dataset description provided by original source is as follows ---
The data has been split into two groups:
training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the performance of the predictive models using the training dataset.
Abstract: Description: This repository presents data collected to investigate the role of embodiment and supervision in learning. This is done inside a simulated 3D maze world with a navigation task using mainly visual input in the form of RGB images. The main contribution of this data repository is to provide a network model trained in this environment with weak supervision and a closed loop between action and perception. Additionally, control networks are provided which were trained with varying degrees of supervision and embodiment. In the corresponding paper [1] the representations of these networks are compared based on sparsity measures and well as content of the encodings and the possibility to extract semantic labels. For the training of the control conditions several new data sets were created which are also included here. They contain a collection of images from the simulated world with corresponding semantic labels. Overall, they provide a good basis for further analysis and a more in-depth investigation of representation learning and the effect of embodiment and supervision on representations. Abstract: Steps to reproduce: Data was generated through a 3D simulation of a maze environment called Obstacle Tower. The data of interest are the trained neural network weights and the networks activations corresponding with different input frames. Three main networks were trained. A reinforcement learning agent which trained through interaction with the simulated environment, an autoencoder trained to reconstruct images collected by the agent and a classifier, trained to classify objects in the images. Exact training and testing conditions, hyperparameter and network structure are provided in the corresponding paper. For the training of the reinforcement learning agent the Unity ml-agents toolkit PPO implementation is used with small modifications for extra data collection and control experiments. The code we used can be found here: https://github.com/vkakerbeck/ml-agents-dev . Model checkpoint files are saved for different points in training but mostly the final version of the network is analysed in the corresponding paper [1] . The autoencoder and classifier are trained using Python with TensorFlow and Keras. The corresponding code can be found here: https://github.com/vkakerbeck/Learning-World-Representations/tree/master/DataAnalysis . The data also contains activations in the hidden layer of the network corresponding to 4000 test images for all three networks. Code for this can be found in the same GitHub repository. The datasets used for training the autoencoder and classifier were created by collecting observations in the Obstacle Tower environment using the trained agent. These observations were then labelled automatically, and the labels were cross checked by hand. A Description of the individual files is included in the data folder (Description.txt). Due to storage constraints no all model checkpoint files used to create figure 6 of the paper could be uploaded. However, feel free to contact me (vkakerbeck[at]uos.de) if you are intrested in these detailed checkpoint files of the controll runs and I will make them available to you.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The test set contains 3,000 images, which were drawn from the same distribution as the training set. The test set is used to evaluate the performance of machine learning models on unseen data.
With the preprocessed dataset at hand, we can now move forward with a variety of actions depending on the nature of the data and the specific goals of the analysis. Some common actions might include:
Exploratory Data Analysis (EDA): We can begin by exploring the dataset to gain insights and a better understanding of its structure, contents, and statistical properties. This can involve tasks such as computing summary statistics, visualizing distributions, detecting outliers, and identifying patterns or relationships between variables.
Feature Engineering: If the dataset contains raw data or basic features, we can create new features that may be more informative or suitable for the specific analysis. This can involve mathematical transformations, combining existing features, or extracting relevant information from text or timestamps.
Model Training: With the preprocessed dataset, we can proceed with training machine learning models to perform various tasks such as classification, regression, clustering, or recommendation. This typically involves splitting the data into training and testing sets, selecting appropriate models, and optimizing their parameters to achieve the best performance.
Model Evaluation: Once the models are trained, we can evaluate their performance using appropriate metrics such as accuracy, precision, recall, or mean squared error. This allows us to assess how well the models generalize to unseen data and make informed decisions about their effectiveness.
Predictions and Inference: Using the trained models, we can make predictions or perform inference on new or unseen data points. This can be valuable for tasks such as making predictions about future events, identifying anomalies, or generating recommendations based on user preferences.
Visualization and Reporting: To communicate the findings and results effectively, we can create visualizations, reports, or interactive dashboards summarizing the analysis. This helps stakeholders understand the insights and make informed decisions based on the data.
By leveraging the preprocessed dataset, we can streamline our analysis and focus on extracting meaningful insights or solving specific problems without the need for extensive data cleaning and preprocessing steps.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
If you're working on a house price prediction project, it's common to have both training and testing datasets that contain valuable information about different properties. The training set is utilized to train your machine learning model, allowing it to learn patterns and relationships within the data, while the testing set is held back to evaluate how well the model generalizes to new, unseen data.
However, in certain scenarios, practitioners may choose to combine the training and testing datasets into a single dataset for efficiency and convenience. This combined dataset approach aims to streamline the coding process, potentially saving time by avoiding the need to manage and preprocess two separate datasets. This can be particularly beneficial in situations where rapid prototyping or exploratory data analysis is the primary focus.
Additionally, the process of feature engineering, which involves transforming raw data into meaningful features, might become more intricate when dealing with a unified dataset. Insights gained from the testing set could influence decisions made during the training phase, potentially compromising the model's ability to accurately predict house prices for new instances.
If you decide to proceed with a combined dataset, careful steps must be taken to mitigate potential issues, such as handling data preprocessing, missing values, and feature scaling separately for the training and testing portions. Additionally, it's essential to be cautious with evaluation metrics and consider techniques like cross-validation applied exclusively to the training data to ensure a robust and unbiased assessment of your model's performance.