Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.
MULTI-TEMPORAL REMOTE SENSING IMAGE CLASSIFICATION - A MULTI-VIEW APPROACH VARUN CHANDOLA AND RANGA RAJU VATSAVAI Abstract. Multispectral remote sensing images have been widely used for automated land use and land cover classification tasks. Often thematic classification is done using single date image, however in many instances a single date image is not informative enough to distinguish between different land cover types. In this paper we show how one can use multiple images, collected at different times of year (for example, during crop growing season), to learn a better classifier. We propose two approaches, an ensemble of classifiers approach and a co-training based approach, and show how both of these methods outperform a straightforward stacked vector approach often used in multi-temporal image classification. Additionally, the co-training based method addresses the challenge of limited labeled training data in supervised classification, as this classification scheme utilizes a large number of unlabeled samples (which comes for free) in conjunction with a small set of labeled training data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf
In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:
if you found this dataset useful for your research, please cite:
@inproceedings{NEURIPS2022_ee604e1b, author = {Ji, Yuanfeng and Bai, Haotian and GE, Chongjian and Yang, Jie and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhanng, Lingyan and Ma, Wanling and Wan, Xiang and Luo, Ping}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {36722--36732}, publisher = {Curran Associates, Inc.}, title = {AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/ee604e1bedbd069d9fc9328b7b9584be-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022} }
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at ECML-PKDD 2022.
The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to two protocols that are suited for quantification research. The goal of quantification is not to predict the star rating of each individual instance, but the distribution of ratings in sets of textual reviews. More generally speaking, quantification aims at estimating the distribution of labels in unlabeled samples of data.
The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task.
This data set comprises two representations of the McAuley data. The first representation consists of TF-IDF features. The second representation is a RoBERTa embedding. This second representation is dense, while the first is sparse. In our experience, logistic regression classifiers work well with both representations. RoBERTa embeddings yield more accurate predictors than the TF-IDF features.
You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.
Extraction scripts and experiments: https://github.com/mirkobunse/ecml22
Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/
DensePASS - a novel densely annotated dataset for panoramic segmentation under cross-domain conditions, specifically built to study the Pinhole-to-Panoramic transfer and accompanied with pinhole camera training examples obtained from Cityscapes. DensePASS covers both, labelled- and unlabelled 360-degree images, with the labelled data comprising 19 classes which explicitly fit the categories available in the source domain (i.e. pinhole) data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed information about the test set and evaluation results.
Studying the flow of chemical moieties through the complex set of metabolic reactions that happen in the cell is essential to understanding the alterations in homeostasis that occur in disease. Recently, LC/MS-based untargeted metabolomics and isotopically labeled metabolites have been used to facilitate the unbiased mapping of labeled moieties through metabolic pathways. However, due to the complexity of the resulting experimental data sets few computational tools are available for data analysis. Here we introduce geoRge, a novel computational approach capable of analyzing untargeted LC/MS data from stable isotope-labeling experiments. geoRge is written in the open language R and runs on the output structure of the XCMS package, which is in widespread use. As opposed to the few existing tools, which use labeled samples to track stable isotopes by iterating over all MS signals using the theoretical mass difference between the light and heavy isotopes, geoRge uses unlabeled and labeled biologically equivalent samples to compare isotopic distributions in the mass spectra. Isotopically enriched compounds change their isotopic distribution as compared to unlabeled compounds. This is directly reflected in a number of new m/z peaks and higher intensity peaks in the mass spectra of labeled samples relative to the unlabeled equivalents. The automated untargeted isotope annotation and relative quantification capabilities of geoRge are demonstrated by the analysis of LC/MS data from a human retinal pigment epithelium cell line (ARPE-19) grown on normal and high glucose concentrations mimicking diabetic retinopathy conditions in vitro. In addition, we compared the results of geoRge with the outcome of X(13)CMS, since both approaches rely entirely on XCMS parameters for feature selection, namely m/z and retention time values. geoRge is available as an R script at https://github.com/jcapelladesto/geoRge.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
DENTEX CHALLENGE
We present the Dental Enumeration and Diagnosis on Panoramic X-rays Challenge (DENTEX), organized in conjunction with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2023. The primary objective of this challenge is to develop algorithms that can accurately detect abnormal teeth with dental enumeration and associated diagnosis. This not only aids in accurate treatment planning but also helps practitioners carry out procedures with a low margin of error.
The challenge provides three types of hierarchically annotated data and additional unlabeled X-rays for optional pre-training. The annotation of the data is structured using the Fédération Dentaire Internationale (FDI) system. The first set of data is partially labeled because it only includes quadrant information. The second set of data is also partially labeled but contains additional enumeration information along with the quadrant. The third data is fully labeled because it includes all quadrant-enumeration-diagnosis information for each abnormal tooth, and all participant algorithms will be benchmarked on the third data.
DENTEX aims to provide insights into the effectiveness of AI in dental radiology analysis and its potential to improve dental practice by comparing frameworks that simultaneously point out abnormal teeth with dental enumeration and associated diagnosis on panoramic dental X-rays.
DATA
The DENTEX dataset comprises panoramic dental X-rays obtained from three different institutions using standard clinical conditions but varying equipment and imaging protocols, resulting in diverse image quality reflecting heterogeneous clinical practice. The dataset includes X-rays from patients aged 12 and above, randomly selected from the hospital's database to ensure patient privacy and confidentiality.
To enable effective use of the FDI system, the dataset is hierarchically organized into three types of data;
(a) 693 X-rays labeled for quadrant detection and quadrant classes only,
(b) 634 X-rays labeled for tooth detection with quadrant and tooth enumeration classes,
(c) 1005 X-rays fully labeled for abnormal tooth detection with quadrant, tooth enumeration, and diagnosis classes.
The diagnosis class includes four specific categories: caries, deep caries, periapical lesions, and impacted teeth. An additional 1571 unlabeled X-rays are provided for pre-training.
Data Split for Evaluation and Training
The DENTEX 2023 dataset comprises three types of data: (a) partially annotated quadrant data, (b) partially annotated quadrant-enumeration data, and (c) fully annotated quadrant-enumeration-diagnosis data. The first two types of data are intended for training and development purposes, while the third type is used for training and evaluations.
To comply with standard machine learning practices, the fully annotated third dataset, consisting of 1005 panoramic X-rays, is partitioned into training, validation, and testing subsets, comprising 705, 50, and 250 images, respectively. Ground truth labels are provided only for the training data, while the validation data is provided without associated ground truth, and the testing data is kept hidden from participants.
Annotation Protocol
The DENTEX provides three hierarchically annotated datasets that facilitate various dental detection tasks: (1) quadrant-only for quadrant detection, (2) quadrant-enumeration for tooth detection, and (3) quadrant-enumeration-diagnosis for abnormal tooth detection. Although it may seem redundant to provide a quadrant detection dataset, it is crucial for utilizing the FDI Numbering System. The FDI system is a globally-used system that assigns each quadrant of the mouth a number from 1 through 4. The top right is 1, the top left is 2, the bottom left is 3, and the bottom right is 4. Then each of the eight teeth and each molar are numbered 1 through 8. The 1 starts at the front middle tooth, and the numbers rise the farther back we go. So for example, the back tooth on the lower left side would be 48 according to FDI notation, which means quadrant 4, number 8. Therefore, the quadrant segmentation dataset can significantly simplify the dental enumeration task, even though evaluations will be made only on the fully annotated third data.
Description is from: https://zenodo.org/record/7812323#.ZDQE1uxBwUG
Grand Challenge: https://dentex.grand-challenge.org/
Cite: [1] Ibrahim Ethem Hamamci, Sezgin Er, Enis Simsar, Anjany Sekuboyina, Mustafa Gundogar, Bernd Stadlinger, Albert Mehl, Bjoern Menze., Diffusion-Based Hierarchical Multi-Label Object Detection to Analyze Panoramic Dental X-rays, 2023. Pre-print: https://arxiv.org/abs/2303.06500 [2] Hamamci, I., Er, S., Simsar, E., Yuksel, A., Gultekin, S., Ozdemir, S., Yang, K., Li, H., Pati, S., Stadlinger, B., & others (2023). DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays. Pre-print: https://arxiv.org/abs/2305.19112
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of images and training parameters used to train the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets used for the paper "Fuzzy Overclustering: Semi-supervised classification of fuzzy labels with overclustering and inverse cross-entropy" in the open access journal Sensors (https://doi.org/10.3390/s21196661). The source code is available at https://github.com/Emprime/FuzzyOverclustering and the preprint at https://arxiv.org/abs/2110.06630 .
Please cite as
@Article{Schmarje2021foc,
AUTHOR = {Schmarje, Lars and Brünger, Johannes and Santarossa, Monty and Schröder, Simon-Martin and Kiko, Rainer and Koch, Reinhard},
TITLE = {Fuzzy Overclustering: Semi-Supervised Classification of Fuzzy Labels with Overclustering and Inverse Cross-Entropy},
JOURNAL = {Sensors},
VOLUME = {21},
YEAR = {2021},
NUMBER = {19},
ARTICLE-NUMBER = {6661},
URL = {https://www.mdpi.com/1424-8220/21/19/6661; https://doi.org/10.5281/zenodo.5550919},
ISSN = {1424-8220},
DOI = {10.3390/s21196661}
}
We provide the used plankton and synthetic datasets with the following explanations summarized from the above-mentioned paper. The technical descriptions are given below.
Plankton Dataset
The plankton dataset contains diverse grey-level images of marine planktonic organisms. The images were captured with an Underwater Vision Profiler 5 and are hosted on EcoTaxa (https://ecotaxa.obs-vlfr.fr/). In the citizen science project PlanktonID (https://planktonid.geomar.de/en), each sample was classified multiple times by citizen scientists. We used the data generated by version two of PlanktonID. This version is only available to users which have done at least 1000 annotations in version one. Therefore, we can assume that the annotators know the differences between the classes and are dedicated to creating consistent results. The second version of PlanktonID is a game where example images can be sorted into three proposed classes. If none of the proposed classes seem to fit, the user has the option to select any other of the 28 classes (https://planktonid.geomar.de/en/classes) of PlanktonID (including a 'no fitting category' class). The initial proposals are generated by a neural network, hence, a confirmation bias might be introduced. Some classes of the PlanktonID dataset have very few examples in comparison to others (e.g. three single images).
The dataset consists of 12,280 images in originally 28 classes. We picked the largest classes and merged some smaller classes (e.g. different classes of detritus). All other images are assigned to the class 'no fitting category'. We used 400 training and 200 validation images per class. We selected these images from the available data where all annotators agreed on the label. All other images are used as unlabeled data. If not enough images for training and validation were available, we used random duplicates.
For more details please see the main paper ( https://doi.org/10.3390/s21196661).
Synthetic Datasets
This dataset is a mixture of circles and ellipses (bubbles) on a black background with different colors. The 6 ground-truth classes are blue, red and green circles or ellipses. An image is defined as certain if the hue of the color is 0 (red), 120 (green) or 240 (blue) and the main axis ratio of the bubble is 1 (circle) or 2 (ellipse). Every other datapoint is considered fuzzy and the ground-truth label is the interpolation of the 6 ground-truth classes. The dataset consists of 1800 certain and 1000 fuzzy labeled images for train, validation and unlabeled data split.
We defined three subsets: Ideal, Real, Fuzzy. The Ideal subset uses the majority ground-truth label for every image which is in Reality not available. For the Real subset, the ground-truth classes in randomly picked from the ground-truth label distribution and represent an noisy / fuzzy annotation. The Fuzzy subset only uses certain labeled images as training data and represent a cleaned training dataset.
For more details please see the main paper ( https://doi.org/10.3390/s21196661).
Technical description
Each folder represents one dataset. The subfolders train, val and unlabeled represent the used data splits Training, Validation and Unlabeled respectively. The used ground-truth labels is given a folder name for each image. Each image is one datapoint.
The filename for the plankton data is structured as
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:
1449 densely labeled pairs of aligned RGB and depth images 464 new scenes taken from 3 cities 407,024 new unlabeled frames Each object is labeled with a class and an instance number. The dataset has several components: Labeled: A subset of the video data accompanied by dense multi-class labels. This data has also been preprocessed to fill in missing depth labels. Raw: The raw RGB, depth and accelerometer data as provided by the Kinect. Toolbox: Useful functions for manipulating the data and labels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"
All timestamps are given in ISO 8601 format.
The following files are included:
Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv
Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.
timestamp: Date and time of the detection.
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).
waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).
Berlin2019_dances.csv
Automatic detections of dance behavior during our recording period in 2019.
dancer_id: Unique ID of the individual bee.
dance_id: Unique ID of the dance.
ts_from, ts_to: Date and time of the beginning and end of the dance.
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
median_x, median_y: Median position of the individual during the dance.
feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.
Berlin2019_followers.csv
Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.
dance_id: Unique ID of the dance being attended or followed.
follower_id: Unique ID of the individual attending or following the dance.
ts_from, ts_to: Date and time of the beginning and end of the interaction.
label: “attendance” or “follower”
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
Berlin2019_dances_with_manually_verified_times.csv
A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).
dance_id: Unique ID of the dance.
dancer_id: Unique ID of the dancing individual.
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.
dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.
Berlin2019_dance_classifier_labels.csv
Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.
timestamp: Timestamp of the individual frame the behavior was observed in.
frame_id: Unique ID of the video frame the behavior was observed in.
bee_id: Unique ID of the individual bee.
label: One of “nothing”, “waggle”, “follower”
Berlin2019_dance_classifier_unlabeled.csv
Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.
Berlin2021_waggle_phase_classifier_labels.csv
Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.
detection_id: Unique ID of the waggle phase.
label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.
orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).
metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.
Berlin2021_waggle_phase_classifier_ground_truth.zip
The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.
Berlin2019_tracks.zip
Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training.
We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.
The individual files contain the following columns:
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
timestamp: Date and time of the detection.
frame_id: Unique ID of the video frame of the recording from which the detection was extracted.
track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.
bee_id: Unique ID of the individual bee.
bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.
x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.
orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).
Berlin2019_feeder_experiment_log.csv
Experiment log for our feeder experiments in 2019.
date: Date given in the format year-month-day.
feeder_cam_id: Numeric ID of the feeder.
coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.
time_opened, time_closed: Date and time when the feeder was set up or closed again.
sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.
Software used to acquire and analyze the data:
bb_pipeline_models: Pretrained localizer and decoder models for bb_pipeline
bb_behavior: Database interaction and data (pre)processing, feature extraction
bb_wdd2: Automatic detection and decoding of honey bee waggle dances
bb_wdd_filter: Machine learning model to improve the accuracy of the waggle dance detector
bb_dance_networks: Detection of dancing and following behavior from trajectories
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains the data and the code to generate the main results reported in "Detecting Edgeworth Cycles" by Timothy Holt, Mitsuru Igami, and Simon Scheidegger, to be published in the February 2024 issue of The Journal of Law and Economics.
Additionally, this package also allows the users to apply these pre-trained models to new datasets of their choice. As an example of such a new dataset, we include the entire German dataset available at the time of our research (2014:Q4–2020:Q4), including both manually labeled and unlabeled subsamples.
Finally, this package includes tools to facilitate the acquisition and pre-processing of the most recent data from Germany, which is updated every day on the Tankerkoenig website (at the time of our preparation of this package).
[3/29/2024 update] The URL for the German antitrust authority's fuel-data website has changed to https://www.bundeskartellamt.de/EN/Tasks/markettransparencyunit_fuels/markettransparencyunit_fuels.html.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq