Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.
The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.
Each dataset contains traces of a specific attack scenario:
The log data collected from the servers includes
Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publications:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The results presented in Fig 6 are averaged over all percentages of labeled data. Best performance for each dataset is in bold.
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.
Citation
If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Data curators
Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
ABOUT FSDKaggle2019
Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.
FSDKaggle2019 employs audio clips from the following sources:
Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology
The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)
The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.
What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.
Ground Truth Labels
The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).
The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].
The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].
Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:
curated train set: correct (but potentially incomplete) labels
noisy train set: noisy labels
test set: correct and complete labels
Further details can be found below in the sections for each set.
Format
All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
DATA SPLIT
FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.
Curated train set
The curated train set consists of manually-labeled data from FSD.
Number of clips/class: 75 except in a few cases (where there are less)
Total number of clips: 4970
Avg number of labels/clip: 1.2
Total duration: 10.5 hours
The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).
Noisy train set
The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].
Number of clips/class: 300
Total number of clips: 19,815
Avg number of labels/clip: 1.2
Total duration: ~80 hours
The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.
Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.
Test set
The test set is used for system evaluation and consists of manually-labeled data from FSD.
Number of clips/class: between 50 and 150
Total number of clips: 4481
Avg number of labels/clip: 1.4
Total duration: 12.9 hours
The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.
During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).
Acoustic mismatch
As mentioned before, FSDKaggle2019 uses audio clips from two sources:
FSD: curated train set and test set, and
YFCC: noisy train set.
While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.
This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.
LICENSE
All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.
Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.
Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.
In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.
FILES & DOWNLOAD
FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:
root
│
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set
│
└───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
With fast growth, synthetic biology powers us with the capability to produce high commercial value products in an efficient resource/energy-consuming manner. Comprehensive knowledge of the protein regulatory network of a bacterial host chassis, e.g., the actual amount of the given proteins, is the key to building cell factories for certain target hyperproduction. Many talent methods have been introduced for absolute quantitative proteomics. However, for most cases, a set of reference peptides with isotopic labeling (e.g., SIL, AQUA, QconCAT) or a set of reference proteins (e.g., commercial UPS2 kit) needs to be prepared. The higher cost hinders these methods for large sample research. In this work, we proposed a novel metabolic labeling-based absolute quantification approach (termed nMAQ). The reference Corynebacterium glutamicum strain is metabolically labeled with 15N, and a set of endogenous anchor proteins of the reference proteome is quantified by chemically synthesized light (14N) peptides. The prequantified reference proteome was then utilized as an internal standard (IS) and spiked into the target (14N) samples. SWATH-MS analysis is performed to obtain the absolute expression levels of the proteins from the target cells. The cost for nMAQ is estimated to be less than 10 dollars per sample. We have benchmarked the quantitative performance of the novel method. We believe this method will help with the deep understanding of the intrinsic regulatory mechanism of C. glutamicum during bioengineering and will promote the process of building cell factories for synthetic biology.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Introduced Parasitoid-Host Records in New Zealand as recorded from specimen labels in the NZAC Data and Resources Introduced Parasitoid-Host Records in NZCSV Introduced Parasitoid-Host Records in NZ as recorded from specimen labels from the NZAC Explore Preview Download
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.
The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".
The data directory is structured as follows. Each directory mail.
Setup details of the web servers:
OS: Debian Stretch 9.11.6
Services:
Apache2
PHP7
Exim 4.89
Horde 5.2.22
OkayCMS 2.3.4
Suricata
ClamAV
MariaDB
Setup details of user machines:
OS: Ubuntu Bionic
Services:
Chromium
Firefox
User host machines are assigned to web servers in the following way:
mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
mail.insect.com is accessed by users from host machines user-{4, 9}
mail.onion.com is accessed by users from host machines user-{7, 10}
The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):
Attack 1: multi-step attack with sequential execution of the following attacks:
nmap scan
nikto scan
smtp-user-enum tool for account enumeration
hydra brute force login
webshell upload through Horde exploit (CVE-2019-9858)
privilege escalation through Exim exploit (CVE-2019-10149)
Attack 2: webshell injection through malicious cookie (CVE-2019-16885)
Attacks are launched from the following user host machines. In each of the corresponding directories user-
user-6 attacks mail.cup.com
user-5 attacks mail.spiral.com
user-4 attacks mail.insect.com
user-7 attacks mail.onion.com
The log data collected from the web servers includes
Apache access and error logs
syscall logs collected with the Linux audit daemon
suricata logs
exim logs
auth logs
daemon logs
mail logs
syslogs
user logs
Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.
Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.
Version history and related data sets:
AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Plasmids play an essential role in horizontal gene transfer, aiding their host bacteria in acquiring beneficial traits like antibiotic and metal resistance. There exists some plasmids that can transfer, replicate or persist in multiple organisms. Identifying the relatively complete host range of these plasmids provides insights into how plasmids promote bacterial evolution. To achieve this, we can apply multi-label learning models for plasmid host range prediction. However, there are no databases providing the detailed and complete host labels of these broad-host-range (BHR) plasmids. Without adequate well-annotated training samples, learning models can fail to extract discriminative feature representations for plasmid host prediction.
To address this problem, we propose a self-correction multi-label learning model called MOSTPLAS. We design a pseudo label learning algorithm and a self-correction asymmetric loss to facilitate the training of multi-label learning model with samples containing some unknown missing labels. We conducted a series of experiments on NCBI RefSeq plasmid database, plasmids with experimentally determined host labels, Hi-C dataset and DoriC dataset. The benchmark results against other plasmid host range prediction tools demonstrated that MOSTPLAS recognized more host labels while keeping a high precision.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Cho NH (2022):OpenCell: Endogenous tagging for the cartography of human cellular organization. curated by BioGRID (https://thebiogrid.org); ABSTRACT: Elucidating the wiring diagram of the human cell is a central goal of the postgenomic era. We combined genome engineering, confocal live-cell imaging, mass spectrometry, and data science to systematically map the localization and interactions of human proteins. Our approach provides a data-driven description of the molecular and spatial networks that organize the proteome. Unsupervised clustering of these networks delineates functional communities that facilitate biological discovery. We found that remarkably precise functional information can be derived from protein localization patterns, which often contain enough information to identify molecular interactions, and that RNA binding proteins form a specific subgroup defined by unique interaction and localization properties. Paired with a fully interactive website (opencell.czbiohub.org), our work constitutes a resource for the quantitative cartography of human cellular organization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results. This dataset was introduced in the paper "Learning Features of Music from Scratch." [1]
This repository consists of 3 top-level files:
A PyTorch interface for accessing the MusicNet dataset is available on GitHub. For an audio/visual introduction and summary of this dataset, see the MusicNet inspector, created by Jong Wook Kim. The audio recordings in MusicNet consist of Creative Commons licensed and Public Domain performances, sourced from the Isabella Stewart Gardner Museum, the European Archive Foundation, and Musopen. The provenance of specific recordings and midis are described in the metadata file.
[1] Learning Features of Music from Scratch. John Thickstun, Zaid Harchaoui, and Sham M. Kakade. In International Conference on Learning Representations (ICLR), 2017. ArXiv Report.
@inproceedings{thickstun2017learning,
title={Learning Features of Music from Scratch},
author = {John Thickstun and Zaid Harchaoui and Sham M. Kakade},
year={2017},
booktitle = {International Conference on Learning Representations (ICLR)}
}
[2] Invariances and Data Augmentation for Supervised Music Transcription. John Thickstun, Zaid Harchaoui, Dean P. Foster, and Sham M. Kakade. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. ArXiv Report.
@inproceedings{thickstun2018invariances,
title={Invariances and Data Augmentation for Supervised Music Transcription},
author = {John Thickstun and Zaid Harchaoui and Dean P. Foster and Sham M. Kakade},
year={2018},
booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}
}
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
DENTEX CHALLENGE
We present the Dental Enumeration and Diagnosis on Panoramic X-rays Challenge (DENTEX), organized in conjunction with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2023. The primary objective of this challenge is to develop algorithms that can accurately detect abnormal teeth with dental enumeration and associated diagnosis. This not only aids in accurate treatment planning but also helps practitioners carry out procedures with a low margin of error.
The challenge provides three types of hierarchically annotated data and additional unlabeled X-rays for optional pre-training. The annotation of the data is structured using the Fédération Dentaire Internationale (FDI) system. The first set of data is partially labeled because it only includes quadrant information. The second set of data is also partially labeled but contains additional enumeration information along with the quadrant. The third data is fully labeled because it includes all quadrant-enumeration-diagnosis information for each abnormal tooth, and all participant algorithms will be benchmarked on the third data.
DENTEX aims to provide insights into the effectiveness of AI in dental radiology analysis and its potential to improve dental practice by comparing frameworks that simultaneously point out abnormal teeth with dental enumeration and associated diagnosis on panoramic dental X-rays.
DATA
The DENTEX dataset comprises panoramic dental X-rays obtained from three different institutions using standard clinical conditions but varying equipment and imaging protocols, resulting in diverse image quality reflecting heterogeneous clinical practice. The dataset includes X-rays from patients aged 12 and above, randomly selected from the hospital's database to ensure patient privacy and confidentiality.
To enable effective use of the FDI system, the dataset is hierarchically organized into three types of data;
(a) 693 X-rays labeled for quadrant detection and quadrant classes only,
(b) 634 X-rays labeled for tooth detection with quadrant and tooth enumeration classes,
(c) 1005 X-rays fully labeled for abnormal tooth detection with quadrant, tooth enumeration, and diagnosis classes.
The diagnosis class includes four specific categories: caries, deep caries, periapical lesions, and impacted teeth. An additional 1571 unlabeled X-rays are provided for pre-training.
Data Split for Evaluation and Training
The DENTEX 2023 dataset comprises three types of data: (a) partially annotated quadrant data, (b) partially annotated quadrant-enumeration data, and (c) fully annotated quadrant-enumeration-diagnosis data. The first two types of data are intended for training and development purposes, while the third type is used for training and evaluations.
To comply with standard machine learning practices, the fully annotated third dataset, consisting of 1005 panoramic X-rays, is partitioned into training, validation, and testing subsets, comprising 705, 50, and 250 images, respectively. Ground truth labels are provided only for the training data, while the validation data is provided without associated ground truth, and the testing data is kept hidden from participants.
Annotation Protocol
The DENTEX provides three hierarchically annotated datasets that facilitate various dental detection tasks: (1) quadrant-only for quadrant detection, (2) quadrant-enumeration for tooth detection, and (3) quadrant-enumeration-diagnosis for abnormal tooth detection. Although it may seem redundant to provide a quadrant detection dataset, it is crucial for utilizing the FDI Numbering System. The FDI system is a globally-used system that assigns each quadrant of the mouth a number from 1 through 4. The top right is 1, the top left is 2, the bottom left is 3, and the bottom right is 4. Then each of the eight teeth and each molar are numbered 1 through 8. The 1 starts at the front middle tooth, and the numbers rise the farther back we go. So for example, the back tooth on the lower left side would be 48 according to FDI notation, which means quadrant 4, number 8. Therefore, the quadrant segmentation dataset can significantly simplify the dental enumeration task, even though evaluations will be made only on the fully annotated third data.
Description is from: https://zenodo.org/record/7812323#.ZDQE1uxBwUG
Grand Challenge: https://dentex.grand-challenge.org/
Cite: [1] Ibrahim Ethem Hamamci, Sezgin Er, Enis Simsar, Anjany Sekuboyina, Mustafa Gundogar, Bernd Stadlinger, Albert Mehl, Bjoern Menze., Diffusion-Based Hierarchical Multi-Label Object Detection to Analyze Panoramic Dental X-rays, 2023. Pre-print: https://arxiv.org/abs/2303.06500 [2] Hamamci, I., Er, S., Simsar, E., Yuksel, A., Gultekin, S., Ozdemir, S., Yang, K., Li, H., Pati, S., Stadlinger, B., & others (2023). DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays. Pre-print: https://arxiv.org/abs/2305.19112
The Mathematics Subject Classification organizes Publications, Software, and Research Data into a hierarchical classification scheme maintained by MathSciNet (mr) and zbMATH Open (zbmath). According to the classification scheme, both organizations mr and zbmath agree on this classification and use labels to organize publications from mathematics and related fields. However, the classification of individual papers is done independently of each other. This dataset contains references to papers that occur in both collections (mr and zbmath) together with the respective classification labels.
The dataset is followed in the follwing form
zbmath-id, zbmath-msc, mr-id, mr-msc 5635019, 55-06 57-06 55R70 57Q45 00B25, MR2556072, 54-06 55-06 5641347, 68R10 05C85, MR2588354, 68W25 05C70 05C85 5641348, 68R10, MR2588355, 68Q25 05C65 05C70 05C85 68Q15 5641349, 68Q05, MR2588356, 68Q05 5641350, 68M20 68W25 68T42, MR2588357, 68T42 68W25 5641351, 68M10, MR2588358, 68Q85 68Q10 5641352, 68R15, MR2588359, 68R15 05A05 05C78 5641353, 68T30 68R10, MR2588360, 05C62 68R10 5641354, 68Q30, MR2588361, 68Q30 60A99 60J20 5641355, 68W27 68M10, MR2588362, 68M10 05C82 05C85 68W27 68W40 5641356, 68W05 68T05, MR2588363, 68T05 62H30 5641357, 68W40, MR2588364, 05A15 68R05 5641358, 91A10 91A05 68Q17 91A06, MR2588365, 91A05 68W25 5641359, 91A10 68T42 68M10, MR2588366, 91B26 5641360, 68W40 68P05 68P10, MR2588367, 68W40 68P05 68Q87 5641361, 68P25 94A62, MR2588368, 94A62 11T71 68P25 5641362, 68Q45, MR2588369, 68Q45 05A05 5641363, 90C35, MR2588370, 68R10 05C85 68W25 90C35 5641364, 54H20, MR2588371, 37E10 37B10 37E45
The meaning of the fields is:
zbmath-id Unique identifier from zbMATH Open. Prefix with https://zbmath.org/ to visit additional information on the article. For example, 5635019 is associated with https://zbmath.org/5635019
zbmath-msc space separated list of Mathemematics Subject Classifiction labels created by zbMATH Open staff. A description of the label can be retrieved by prefixing https://zbmath.org/classification/?q=. For example, 55-06 is associated with https://zbmath.org/classification/?q=55-06
mr-id Unique identifier from MathReviews. Prefix with https://mathscinet.ams.org/mathscinet-getitem?mr= to retrieve additional information on the publication. For example, MR3844132 is associated with https://mathscinet.ams.org/mathscinet-getitem?mr=MR2556072.
mr-msc space seperated list of Mathemematics Subject Classifiction labels created by MathSciNet staff.
The dataset was retrieved in 2016 by querying MathSciNet and zbMATH Open. Therefore, the classifications are based on the MSC 2010 version.
This dataset was used in
Schubotz M., Scharpf P., Teschke O., Kühnemund A., Breitinger C., Gipp B. (2020) AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels. In: Benzmüller C., Miller B. (eds) Intelligent Computer Mathematics. CICM 2020. Lecture Notes in Computer Science, vol 12236. Springer, Cham. https://doi.org/10.1007/978-3-030-53518-6_15
and is now released to the public as an addendum to the paper.
Description
This dataset offers an extensive collection of images and corresponding labels representing a wide array of plant diseases. Carefully curated from publicly available sources, it serves as a valuable resource for developing and evaluating machine learning models, particularly in the realms of image classification and Plant Disease Image Dataset.
Dataset Composition:
• Images: The dataset comprises high-quality images organized by plant species and disease type, providing a diverse range of visual data. It includes not only images of diseased plants but also healthy plant samples, ensuring a balanced dataset for training purposes.
• Categories: The images are categorized into various classes based on the plant species and the specific disease affecting them. This categorization allows for more precise model training and testing.
• Cleaned Data: All images have been meticulously cleaned and verified to remove any corrupt or unusable files, ensuring the dataset's reliability and usability.
• Labeling: Each image is labeled with detailed information about the plant species and the type of disease, making it easier to use the dataset for supervised learning tasks.
Download Dataset
Applications:
This dataset is ideal for a variety of machine learning applications, including:
• Disease Detection: Training models to identify and classify various plant diseases, which can be pivotal for early detection and prevention.
• Image Classification: Developing and testing models to accurately classify images based on plant species and health status.
• Agricultural Research: Supporting research in precision agriculture by providing data that can lead to better understanding and management of plant health.
Dataset Structure:
• Organized Folders: The images are structured into folders corresponding to each plant species and disease type, facilitating easy access and manipulation of data.
• Healthy Samples Included: To ensure balanced datasets, healthy plant samples are included alongside diseased ones, enabling models to learn to differentiate between healthy and diseased plants.
• Versatile Use: The dataset's structure and labeling make it suitable for a wide range of research and commercial applications in agriculture and plant biology.
Conclusion:
The Plant Disease Dataset is a comprehensive and well-organized collection, ideal for anyone working on machine learning models in the field of agriculture. Whether you're developing new algorithms for disease detection or enhancing existing models, this dataset provides the rich and diverse data necessary to achieve accurate and reliable results.
This dataset is sourced from Kaggle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the simulation data of the combinatorial metamaterial as used for the paper 'Machine Learning of Implicit Combinatorial Rules in Mechanical Metamaterials', as published in Physical Review Letters.
In this paper, the data is used to classify each \(k \times k\) unit cell design into one of two classes (C or I) based on the scaling (linear or constant) of the number of zero modes \(M_k(n)\) for metamaterials consisting of an \(n\times n\) tiling of the corresponding unit cell. Additionally, a random walk through the design space starting from class C unit cells was performed to characterize the boundary between class C and I in design space. A more detailed description of the contents of the dataset follows below.
Modescaling_raw_data.zip
This file contains uniformly sampled unit cell designs for metamaterial M2 and \(M_k(n)\) for \(1\leq n\leq 4\), which was used to classify the unit cell designs for the data set. There is a small subset of designs for \(k=\{3, 4, 5\}\) that do not neatly fall into the class C and I classification, and instead require additional simulation for \(4 \leq n \leq 6\) before either saturating to a constant number of zero modes (class I) or linearly increasing (class C). This file contains the simulation data of size \(3 \leq k \leq 8\) unit cells. The data is organized as follows.
Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4.npy", and contain a [Nsim, 1+k*k+4] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Note: the unit cell design uses the numbers \(\{0, 1, 2, 3\}\) to refer to each building block orientation. The building block orientations can be characterized through the orientation of the missing diagonal bar (see Fig. 2 in the paper), which can be Left Up (LU), Left Down (LD), Right Up (RU), or Right Down (RD). The numbers correspond to the building block orientation \(\{0, 1, 2, 3\} = \{\mathrm{LU, RU, RD, LD}\}\).
Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 6\) for unit cells that cannot be classified as class C or I for \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4_classX_extend.npy", and contain a [Nsim, 1+k*k+6] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Simulation data for \(6 \leq k \leq 8\) unit cells are stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. Note that the number of modes is now calculated for \(n_x \times n_y\) metamaterials, where we calculate \((n_x, n_y) = \{(1,1), (2, 2), (3, 2), (4,2), (2, 3), (2, 4)\}\) rather than \(n_x=n_y=n\) to save computation time. These files are named "data_new_rrQR_i_n_Mx_My_n4_kxk(_extended).npy", and contain a [Nsim, 1+k*k+8] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Simulation data of metamaterial M1 for \(k_x \times k_y\) metamaterials are stored in compressed numpy array format (.npz) and can be loaded in Python with the Numpy package using the numpy.load command. These files are named "smiley_cube_x_y_\(k_x\)x\(k_y\).npz", which contain all possible metamaterial designs, and "smiley_cube_uniform_sample_x_y_\(k_x\)x\(k_y\).npz", which contain uniformly sampled metamaterial designs. The configurations are accessed with the keyword argument 'configs'. The classification is accessed with the keyword argument 'compatible'. The configurations array is of shape [Nsim, \(k_x\), \(k_y\)], the classification array is of shape [Nsim]. The building blocks in the configuration are denoted by 0 or 1, which correspond to the red/green and white/dashed building blocks respectively. Classification is 0 or 1, which corresponds to I and C respectively.
Modescaling_classification_results.zip
This file contains the classification, slope, and offset of the scaling of the number of zero modes \(M_k(n)\) for the unit cells of metamaterial M2 in Modescaling_raw_data.zip. The data is organized as follows.
The results for \(3 \leq k \leq 5\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 4\))
col 2: slope from \(n \geq 2\) onward (undefined for class X)
col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)
col 4: \(M_k(1)\)
The results for \(3 \leq k \leq 5\) based on the extended \(1 \leq n \leq 6\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4_classC_extend.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 6\))
col 2: slope from \(n \geq 2\) onward (undefined for class X)
col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)
col 4: \(M_k(1)\)
The results for \(6 \leq k \leq 8\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scenx_Sceny_slopex_slopey_offsetx_offsety_M1k_kxk(_extended).txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class_x based on \(M_k(n_x, 2)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_x \leq 4\))
col 2: the class_y based on \(M_k(2, n_y)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_y \leq 4\))
col 3: slope_x from \(n_x \geq 2\) onward (undefined for class X)
col 4: slope_y from \(n_y \geq 2\) onward (undefined for class X)
col 5: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_x}\)
col 6: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_y}\)
col 7: (M_k(1,
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository provides easy access to open-source soundscape datasets of bird sounds, specifically optimized for few-shot classification.
soundscapes.zip
contains evaluation soundscape datasets from the BIRB benchmark (https://arxiv.org/abs/2312.07439), downsampled to 16kHz, preprocessed using CNN14 from PANNs (https://arxiv.org/abs/1912.10211), to select a 6-second window with the highest bird activation, and converted to Pytorch (.pt) format to facilitate usability for evaluating deep neural networks.
These preprocessed datasets are employed in the work "Domain-Invariant Representation Learning of Bird Sounds" (https://arxiv.org/abs/2409.08589), which evaluates the few-shot learning capabilities of deep learning models trained on focal recordings (e.g., Xeno-Canto) and tested on soundscape recordings.
pow.pt
): The validation dataset consists of 16,047 examples across 43 classes and is organized as a dictionary with 'data'
and 'label'
keys representing bird sounds and their corresponding labels. Storing the entire validation dataset in a single tensor enables rapid loading and efficient processing, significantly accelerating the validation process. Classes with only one example are removed, as they are insufficient for one-shot classification tasks. Source: https://zenodo.org/records/4656848#.Y7ijhOxudhEEach test dataset is structured with multiple subfolders, each labeled with an eBird species code to represent data for a specific bird species.
ssw/
): Contains 50,760 examples across 96 classes. Source: https://zenodo.org/records/7079380#.Y7ijHOxudhEcoffee_farms/
): Contains 6,952 examples across 89 classes. Source: https://zenodo.org/records/7525349#.ZB8z_-xudhEhawaii/
): Contains 59,583 examples across 27 classes. Source: https://zenodo.org/records/7078499#.Y7ijPuxudhEhigh_sierras/
): Contains 10,296 examples across 19 classes. Source: https://zenodo.org/records/7525805#.ZB8zsexudhEsierras_kahl/
): Contains 20,147 examples across 56 classes. Source: https://zenodo.org/records/7050014#.Y7ijWexudhEperu/
): Contains 14,768 examples across 132 classes. Source: https://zenodo.org/records/7079124#.Y7iis-xudhECode and detailed instructions, including data loading, model implementation, and few-shot evaluation, can be found at: https://github.com/ilyassmoummad/ProtoCLR
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description
This dataset is linked to the publication "Recursive classification of satellite imaging time-series: An application to land cover mapping". In this paper, we introduce the recursive Bayesian classifier (RBC), which converts any instantaneous classifier into a robust online method through a probabilistic framework that is resilient to non-informative image variations. To reproduce the results presented in the paper, the RBC-SatImg folder and the code in the GitHub repository RBC-SatImg are required.
The RBC-SatImg folder contains:
Sentinel-2 time-series imagery from three key regions: Oroville Dam (CA, USA) and Charles River (Boston, MA, USA) for water mapping, and the Amazon Rainforest (Brazil) for deforestation detection.
The RBC-WatData dataset with manually generated water mapping labels for the Oroville Dam and Charles River regions. This dataset is well-suited for multitemporal land cover and water mapping research, as it accounts for the dynamic evolution of true class labels over time.
Pickle files with output to reproduce the results in the paper, including:
Instantaneous classification results for GMM, LR, SIC, WN, DWM
Posterior results obtained with the RBC framework
The Sentinel-2 images and forest labels used in the deforestation detection experiment for the Amazon Rainforest have been obtained from the MultiEarth Challenge dataset.
Folder Structure
The following paths can be changed in the configuration file from the GitHub repository as desired. The RBC-SatImg is organized as follows:
./log/
(EMPTY): Default path for storing log files generated during code execution.
./evaluation_results/
: Contains the results to reproduce the findings in the paper, including two sub-folders:
./classification/
: For each test site, four sub-folders are included as:
./accuracy/
: Each sub-folder corresponding to an experimental configuration contains pickle files with balanced classification accuracy results and information about the models. The default configuration used in the paper is "conf_00."
./figures/
: Includes result figures from the manuscript in SVG format.
./likelihoods/
: Contains pickle files with instantaneous classification results.
./posteriors/
: Contains pickle files with posterior results generated by the RBC framework.
./sensitivity_analysis/
: Contains sensitivity analysis results, organized by different test sites and epsilon values.
./Sentinel2_data/
: Contains Sentinel-2 images used for training and evaluation, organized by scenarios (Oroville Dam, Charles River, Amazon Rainforest). Selected images have been filtered and processed as explained in the manuscript. The Amazon Rainforest images and labels have been obtained from the MultiEarth dataset, and consequently, the labels are included in this folder instead of the RBC-WatData folder.
./RBC-WatData/
: Contains the water labels that we manually generated with the LabelStudio tool.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteomics has been applied to study intracellular bacteria and phagocytic vacuoles in different host cell lines, especially macrophages (Mφs). For mycobacterial phagosomes, few studies have identified over several hundred proteins for systems assessment of the phagosome maturation and antigen presentation pathways. More importantly, there has been a scarcity in publication on proteomic characterization of mycobacterial phagosomes in dendritic cells (DCs). In this work, we report a global proteomic analysis of Mφ and DC phagosomes infected with a virulent, an attenuated, and a vaccine strain of mycobacteria. We used label-free quantitative proteomics and bioinformatics tools to decipher the regulation of phagosome maturation and antigen presentation pathways in Mφs and DCs. We found that the phagosomal antigen presentation pathways are repressed more in DCs than in Mφs. The results suggest that virulent mycobacteria might co-opt the host immune system to stimulate granuloma formation for persistence while minimizing the antimicrobial immune response to enhance mycobacterial survival. The studies on phagosomal proteomes have also shown promise in discovering new antigen presentation mechanisms that a professional antigen presentation cell might use to overcome the mycobacterial blockade of conventional antigen presentation pathways.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.
The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.
Each dataset contains traces of a specific attack scenario:
The log data collected from the servers includes
Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publications: