The Fuel Economy Label and CAFE Data asset contains measured summary fuel economy estimates and test data for light-duty vehicle manufacturers by model for certification as required under the Energy Policy and Conservation Act of 1975 (EPCA) and The Energy Independent Security Act of 2007 (EISA) to collect vehicle fuel economy estimates for the creation of Economy Labels and for the calculation of Corporate Average Fuel Economy (CAFE). Manufacturers submit data on an annual basis, or as needed to document vehicle model changes.The EPA performs targeted fuel economy confirmatory tests on approximately 15% of vehicles submitted for validation. Confirmatory data on vehicles is associated with its corresponding submission data to verify the accuracy of manufacturer submissions beyond standard business rules. Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML, and includes descriptive information on the vehicle itself, fuel economy information, and the manufacturer's testing approach. This data may contain proprietary information (CBI) such as information on estimated sales or other data elements indicated by the submitter as confidential. CBI data is not publically available; however, within the EPA data can accessed under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Datasets are segmented by vehicle model/manufacturer and/or year with corresponding fuel economy, test, and certification data. Data assets are stored in EPA's Verify system.Coverage began in 1974 with early records being primarily paper documents which did not go through the same level of validation as primarily digital submissions which started in 2008. Early data is available to the public digitally starting from 1978, but more complete digital certification data is available starting in 2008. Fuel economy submission data prior to 2006 was calculated using an older formula; however, mechanisms exist to make this data comparable to current results.Fuel Economy Label and CAFE Data submission documents with metadata, certificate and summary decision information is utilized and made publically available through the EPA/DOE's Fuel Economy Guide Website (https://www.fueleconomy.gov/) as well as EPA's Smartway Program Website (https://www.epa.gov/smartway/) and Green Vehicle Guide Website (http://ofmpub.epa.gov/greenvehicles/Index.do;jsessionid=3F4QPhhYDYJxv1L3YLYxqh6J2CwL0GkxSSJTl2xgMTYPBKYS00vw!788633877) after it has been quality assured. Where summary data appears inaccurate, OTAQ returns the entries for review to their originator.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This guide is designed to assist in the preparation of labels that comply with Canadian regulatory requirements for cosmetics.
The documentation below is in reference to this items placement in the NM Supply Chain Data Hub. The documentation is of use to understanding the source of this item, and how to reproduce it for updatesTitle: FOOD LABELS EXPOSED A definitive guide to common food label terms and claimsItem Type: pdfSummary: Definitions of common food labels and certifications and verifying organizations.Notes: Prepared by: Uploaded by EMcRae_NMCDCSource: https://agreenerworld.org/certifications/animal-welfare-approved/Feature Service: https://nmcdc.maps.arcgis.com/home/item.html?id=64388624b30242298d4133a61289ac58#UID: 24Data Requested: https://agreenerworld.org/certifications/animal-welfare-approved/Method of Acquisition: Downloaded from public website A Greener WorldDate Acquired: 6/23/22Priority rank as Identified in 2022 (scale of 1 being the highest priority, to 11 being the lowest priority): 5Tags: PENDING
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This guide is designed to assist in the preparation of labels that comply with Canadian regulatory requirements for cosmetics.
This dataset was created by the DC Office of Planning and provides a simplified representation of the neighborhoods of the District of Columbia. These boundaries are used by the Office of Planning to determine appropriate locations for placement of neighborhood names on maps. They do not reflect detailed boundary information, do not necessarily include all commonly-used neighborhood designations, do not match planimetric centerlines, and do not necessarily match Neighborhood Cluster boundaries. There is no formal set of standards that describes which neighborhoods are represented or where boundaries are placed. These informal boundaries are not appropriate for display, calculation, or reporting. Their only appropriate use is to guide the placement of text labels for DC's neighborhoods. This is an informal product used for internal mapping purposes only. It should be considered draft, will be subject to change on an irregular basis, and is not intended for publication.
A Guide to the Thai Green Label Scheme
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Isobaric labeling-based proteomics is widely applied in deep proteome quantification. Among the platforms for isobaric labeled proteomic data analysis, the commercial software Proteome Discoverer (PD) is widely used, incorporating the search engine CHIMERYS, while FragPipe (FP) is relatively new, free for noncommercial purposes, and integrates the engine MSFragger. Here, we compared PD and FP over three public proteomic data sets labeled using 6plex, 10plex, and 16plex tandem mass tags. Our results showed the protein abundances generated by the two software are highly correlated. PD quantified more proteins (10.02%, 15.44%, 8.19%) than FP with comparable NA ratios (0.00% vs. 0.00%, 0.85% vs. 0.38%, and 11.74% vs. 10.52%) in the three data sets. Using the 16plex data set, PD and FP outputs showed high consistency in quantifying technical replicates, batch effects, and functional enrichment in differentially expressed proteins. However, FP saved 93.93%, 96.65%, and 96.41% of processing time compared to PD for analyzing the three data sets, respectively. In conclusion, while PD is a well-maintained commercial software integrating various additional functions and can quantify more proteins, FP is freely available and achieves similar output with a shorter computational time. Our results will guide users in choosing the most suitable quantification software for their needs.
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
The documentation below is in reference to this items placement in the NM Supply Chain Data Hub. The documentation is of use to understanding the source of this item, and how to reproduce it for updatesTitle: Animal Welfare Institute's Consumers Guide Food Labels Animal Welfare 2022Item Type: PDFSummary: "A Consumer's Guide to Food Labels and Animal Welfare. Many food labels are confusing, if not downright misleading. While some food label claims have definitions controlled by the government, most do not have legal definitions. In addition, most label claims are "self-made" by the company merely for marketing purposes, and the accuracy of the claims is not verified." - AWINotes: More information here: https://awionline.org/content/consumers-guide-food-labels-and-animal-welfarePrepared by: Uploaded by EMcRae_NMCDCSource: https://awionline.org/store/catalog/animal-welfare-publications/farm-animals/consumers-guide-food-labels-and-animalFeature Service: https://nmcdc.maps.arcgis.com/home/item.html?id=071a96ba12e949008bbc79ec9a3f217bUID: 24Data Requested: Current regulations: who qualifies and who doesnt, who can we help qualify GAP certs, procedure rules, etc.)Method of Acquisition: Downloaded free PDF from the AWI websiteDate Acquired: 6/14/22Priority rank as Identified in 2022 (scale of 1 being the highest priority, to 11 being the lowest priority): 5Tags: PENDING
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
StreetSurfaceVis
StreetSurfaceVis is an image dataset containing 9,122 street-level images from Germany with labels on road surface type and quality. The CSV file streetSurfaceVis_v1_0.csv contains all image metadata and four folders contain the image files. All images are available in four different sizes, based on the image width, in 256px, 1024px, 2048px and the original size.Folders containing the images are named according to the respective image size. Image files are named based on the mapillary_image_id.
You can find the corresponding publication here: StreetSurfaceVis: a dataset of crowdsourced street-level imagery with semi-automated annotations of road surface type and quality
Image metadata
Each CSV record contains information about one street-level image with the following attributes:
mapillary_image_id: ID provided by Mapillary (see information below on Mapillary)
user_id: Mapillary user ID of contributor
user_name: Mapillary user name of contributor
captured_at: timestamp, capture time of image
longitude, latitude: location the image was taken at
train: Suggestion to split train and test data. True
for train data and False
for test data. Test data contains data from 5 cities which are excluded in the training data.
surface_type: Surface type of the road in the focal area (the center of the lower image half) of the image. Possible values: asphalt, concrete, paving_stones, sett, unpaved
surface_quality: Surface quality of the road in the focal area of the image. Possible values: (1) excellent, (2) good, (3) intermediate, (4) bad, (5) very bad (see the attached Labeling Guide document for details)
Image source
Images are obtained from Mapillary, a crowd-sourcing plattform for street-level imagery. More metadata about each image can be obtained via the Mapillary API . User-generated images are shared by Mapillary under the CC-BY-SA License.
For each image, the dataset contains the mapillary_image_id and user_name. You can access user information on the Mapillary website by https://www.mapillary.com/app/user/ and image information by https://www.mapillary.com/app/?focus=photo&pKey=
If you use the provided images, please adhere to the terms of use of Mapillary.
Instances per class
Total number of images: 9,122
excellent good intermediate bad very bad
asphalt 971 1697 821
concrete 314 350 250
paving stones 385 1063 519
129 694
-
326 387 303
For modeling, we recommend using a train-test split where the test data includes geospatially distinct areas, thereby ensuring the model's ability to generalize to unseen regions is tested. We propose five cities varying in population size and from different regions in Germany for testing - images are tagged accordingly.
Number of test images (train-test split): 776
Inter-rater-reliablility
Three annotators labeled the dataset, such that each image was annotated by one person. Annotators were encouraged to consult each other for a second opinion when uncertain.1,800 images were annotated by all three annotators, resulting in a Krippendorff's alpha of 0.96 for surface type and 0.74 for surface quality.
Recommended image preprocessing
As the focal road located in the bottom center of the street-level image is labeled, it is recommended to crop images to their lower and middle half prior using for classification tasks.
This is an exemplary code for recommended image preprocessing in Python:
from PIL import Imageimg = Image.open(image_path)width, height = img.sizeimg_cropped = img.crop((0.25 * width, 0.5 * height, 0.75 * width, height))
License
CC-BY-SA
Citation
If you use this dataset, please cite as:
Kapp, A., Hoffmann, E., Weigmann, E. et al. StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality. Sci Data 12, 92 (2025). https://doi.org/10.1038/s41597-024-04295-9
@article{kapp_streetsurfacevis_2025, title = {{StreetSurfaceVis}: a dataset of crowdsourced street-level imagery annotated by road surface type and quality}, volume = {12}, issn = {2052-4463}, url = {https://doi.org/10.1038/s41597-024-04295-9}, doi = {10.1038/s41597-024-04295-9}, pages = {92}, number = {1}, journaltitle = {Scientific Data}, shortjournal = {Scientific Data}, author = {Kapp, Alexandra and Hoffmann, Edith and Weigmann, Esther and Mihaljević, Helena}, date = {2025-01-16},}
This is part of the SurfaceAI project at the University of Applied Sciences, HTW Berlin.
Contact: surface-ai@htw-berlin.de
https://surfaceai.github.io/surfaceai/
Funding: SurfaceAI is a mFund project funded by the Federal Ministry for Digital and Transportation Germany.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This guide provides guidance and information about the packaging and labelling requirements for cannabis products under the Cannabis Act and the Cannabis Regulations.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This guide provides information about the packaging and labelling requirements for cannabis and cannabis products under the Cannabis Act and the Cannabis Regulations.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Integrating cross-linking mass spectrometry (XL-MS) into structural biology workflows provides valuable information about the spatial arrangement of amino acid stretches, which can guide elucidation of protein assembly architecture. Additionally, the combination of XL-MS with peptide quantitation techniques is a powerful approach to delineate protein interface dynamics across diverse conditions. While XL-MS is increasingly effective with isolated proteins or small complexes, its application to whole-cell samples poses technical challenges related to analysis depth and throughput. The use of enrichable cross-linkers has greatly improved the detectability of protein interfaces in a proteome-wide scale, facilitating global protein–protein interaction mapping. Therefore, bringing together enrichable cross-linking and multiplexed peptide quantification is an appealing approach to enable comparative characterization of structural attributes of proteins and protein interactions. Here, we combined phospho-enrichable cross-linking with TMT labeling to develop a streamline workflow (PhoXplex) for the detection of differential structural features across a panel of cell lines in a global scale. We achieved deep coverage with quantification of over 9000 cross-links and long loop-links in total including potentially novel interactions. Overlaying AlphaFold predictions and disorder protein annotations enables exploration of the quantitative cross-linking data set, to reveal possible associations between mutations and protein structures. Lastly, we discuss current shortcomings and perspectives for deep whole-cell profiling of protein interfaces at large-scale.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Stable-isotope labeling experiments are widely used to investigate the topology and functioning of metabolic networks. Label incorporation into metabolites can be quantified using a broad range of mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy methods, but in general, no single approach can completely cover isotopic space, even for small metabolites. The number of quantifiable isotopic species could be increased and the coverage of isotopic space improved by integrating measurements obtained by different methods; however, this approach has remained largely unexplored because no framework able to deal with partial, heterogeneous isotopic measurements has yet been developed. Here, we present a generic computational framework based on symbolic calculus that can integrate any isotopic data set by connecting measurements to the chemical structure of the molecules. As a test case, we apply this framework to isotopic analyses of amino acids, which are ubiquitous to life, central to many biological questions, and can be analyzed by a broad range of MS and NMR methods. We demonstrate how this integrative framework helps to (i) clarify and improve the coverage of isotopic space, (ii) evaluate the complementarity and redundancy of different techniques, (iii) consolidate isotopic data sets, (iv) design experiments, and (v) guide future analytical developments. This framework, which can be applied to any labeled element, isotopic tracer, metabolite, and analytical platform, has been implemented in IsoSolve (available at https://github.com/MetaSys-LISBP/IsoSolve and https://pypi.org/project/IsoSolve), an open-source software that can be readily integrated into data analysis pipelines.
This dataset contains field boundaries and crop type information for fields in Kenya. PlantVillage app is used to collect multiple points around each field and collectors have access to basemap imagery in the app during data collection. They use the basemap as a guide in collecting and verifying the points.
Post ground data collection, Radiant Earth Foundation conducted a quality control of the polygons using Sentinel-2 imagery of the growing season as well as Google basemap imagery. Two actions were taken on the data 1)several polygons that had overlapping areas with different crop labels were removed, 2) invalid polygons where multiple points were collected in corners of the field (within a distance of less than 0.5m) and the overall shape was not convex, were corrected. Finally, ground reference polygons were matched with corresponding time series data from Sentinel-2 satellites (listed in the source imagery property of each label item).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Glycans are vital biomolecules with diverse functions in biological processes. Mass spectrometry (MS) has become the most widely employed technology for glycomics studies. However, in the traditional data-dependent acquisition mode, only a subset of the abundant ions during MS1 scans are isolated and fragmented in subsequent MS2 events, which reduces reproducibility and prevents the measurement of low-abundance glycan species. Here, we reported a new method termed 6-plex mdSUGAR isobaric-labeling guide fingerprint embedding (MAGNI), to achieve multiplexed, quantitative, and targeted glycan analysis. The glycan peak signature was embedded by a triplicate-labeling strategy with a 6-plex mdSUGAR tag, and using ultrahigh-resolution mass spectrometers, the low-abundance glycans that carry the mass fingerprints can be recognized on the MS1 spectra through an in-house developed software tool, MAGNIFinder. These embedded unique fingerprints can guide the selection and fragmentation of targeted precursor ions and further provide rich information on glycan structures. Quantitative analysis of two standard glycoproteins demonstrated the accuracy and precision of MAGNI. Using this approach, we identified 304 N-glycans in two ovarian cancer cell lines. Among them, 65 unique N-glycans were found differentially expressed, which indicates a distinct glycosylation pattern for each cell line. Remarkably, 31 N-glycans can be quantified in only 1 × 103 cells, demonstrating the high sensitivity of our method. Taken together, our MAGNI method offers a useful tool for low-abundance N-glycan characterization and is capable of determining small quantitative differences in N-glycan profiling. Therefore, it will be beneficial to the field of glycobiology and will expand our understanding of glycosylation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene clustering is one of the important techniques to identify co-expressed gene groups from gene expression data, which provides a powerful tool for investigating functional relationships of genes in biological process. Self-training is a kind of important semi-supervised learning method and has exhibited good performance on gene clustering problem. However, the self-training process inevitably suffers from mislabeling, the accumulation of which will lead to the degradation of semi-supervised learning performance of gene expression data. To solve the problem, this paper proposes a self-training subspace clustering algorithm based on adaptive confidence for gene expression data (SSCAC), which combines the low-rank representation of gene expression data and adaptive adjustment of label confidence to better guide the partition of unlabeled data. The superiority of the proposed SSCAC algorithm is mainly reflected in the following aspects. 1) In order to improve the discriminative property of gene expression data, the low-rank representation with distance penalty is used to mine the potential subspace structure of data. 2) Considering the problem of mislabeling in self-training, a semi-supervised clustering objective function with label confidence is proposed, and a self-training subspace clustering framework is constructed on this basis. 3) In order to mitigate the negative impact of mislabeled data, an adaptive adjustment strategy based on gravitational search algorithm is proposed for label confidence. Compared with a variety of state-of-the-art unsupervised and semi-supervised learning algorithms, the SSCAC algorithm has demonstrated its superiority through extensive experiments on two benchmark gene expression datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nutrition labels on packaged food items provide at-a-glance information about the nutritional composition of the food, serving as a quick guide for consumers to assess the quality of food products. The aim of the current study is to evaluate the nutritional information on the front and back of pack labels of selected packaged foods in the Indian market. A total of 432 food products in six categories (idli mix, breakfast cereals, porridge mix, soup mix, beverage mix and extruded snacks) were investigated by a survey. Nutritional profiling of the foods was done based on the Food Safety and Standards Authority of India (FSSAI) claims regulations. The healthiness of the packaged foods was assessed utilising nutritional traffic light system. The products were classified into ‘healthy’, ‘moderately healthy’ and ‘less healthy’ based on the fat, saturated fat, and sugar content. Most of the food products evaluated belong to healthy’ and ‘moderately healthy’ categories except for products in extruded snacks. Reformulation of ‘extruded snacks’ are necessary to decrease the total and saturated fat content. The nutrient content claims were classified using the International Network for Food and Obesity / NCDs Research, Monitoring and Action Support (INFORMAS) taxonomy. Protein, dietary fibre, fat, sugar, vitamins and minerals were the most referred nutrients in the nutrient content claims. Breakfast cereal carried highest number of nutritional claims while porridge mix had the lowest number of claims. The overall compliance of the nutrient content claims for the studied food products is 80.5%. This study gives an overall view about the nutritional quality of the studied convenience food products and snacks in Indian market.
http://dcat-ap.de/def/licenses/other-openhttp://dcat-ap.de/def/licenses/other-open
https://www.dzsf.bund.de/DZSF/DE/DZSF_Intern/Bilder_FID_Move/osdar23.png?_blob=normal" alt="data acquisition platform and overview of sequences">Source: own illustration based on images by DB Netz AG ## Project Info
https://www.dzsf.bund.de/DZSF/DE/DZSF_Intern/Bilder_FID_Move/logo_dzsf.png?_blob=normal" alt="DZSF">
https://www.dzsf.bund.de/DZSF/DE/DZSF_Intern/Bilder_FID_Move/logo_dsd.png?_blob=normal" alt="Digitale Schiene Deutschland">
https://www.dzsf.bund.de/DZSF/DE/DZSF_Intern/Bilder_FID_Move/logo_fusionsystems.png?_blob=normal" alt="FusionSystems"> The "Open Sensor Data for Rail 2023" (OSDaR23, 10.57806/9mv146r0) has been created in a joint research project by the German Centre for Rail Traffic Research at the Federal Railway Authority (DZSF), Digitale Schiene Deutschland / DB Netz AG, and FusionSystems GmbH. Research report and Labeling Guide can be obtained from the DZSF website. The data set consists of 45 sequences of annotated multi-sensor data (color camera, infrared camera, lidar, radar, localization, IMU). Data have been collected on different railway tracks in Hamburg, Germany. ## License Info The "Open Sensor Data for Rail 2023" (OSDaR23, 10.57806/9mv146r0) is published by the German Centre for Rail Traffic Research at the Federal Railway Authority (DZSF). Annotation data (file type
.json
) are published under CC0 1.0. Sensor data (file types .png
, .pcd
, and .csv
) are published under CC BY-SA 3.0 de. ## Further Info The data set can be used in Python with the RailLabel package published by DB Netz AG. The data set can be viewed, for example, with the WebLabel Player published by Vicomtech Research Foundation. (Disclaimer: Vicomtech was not part of the research project and there are currently no further relationships between DZSF and Vicomtech.) https://www.dzsf.bund.de/DZSF/DE/DZSF_Intern/Bilder_FID_Move/logo_vicomtech.png?_blob=normal" alt="Vicomtech"> ## Statistics Number of Multisensor-Frames:
1534
Statistic of annotation objects: object class number of annotations ------------------ --------------------- person 73 421 crowd 1 352 train 8 290 wagons 110 bicycle 1 779 group of bicycles 644 motorcycle 14 road vehicle 12 669 animal 3 288 group of animals 0 wheelchair 0 drag shoe 79 track 18 543 transition 636 switch 2 947 catenary pole 27 706 signal pole 14 374 signal 32 790 signal bridge 312 buffer stop 4 539 flame 410 smoke 188 ------------------ --------------------- total 204 091
Introduction This dataset supports Ye et al. 2024 Nature Communications.
Ye, S., Filippova, A., Lauer, J. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat Commun 15, 5165 (2024). https://doi.org/10.1038/s41467-024-48792-2
Please cite this dataset and paper if you use this resource. Please also see Ye et al. 2024 for the full DataSheet accompanying this download, including the metadata for how to use this data if you want to compare model results on benchmark tasks. Below is just a summary. Also see the dataset licensing below.
Training Data It consists of being trained together on the following datasets:
AwA-Pose Quadruped dataset, see full details at (1). AnimalPose See full details at (2). AcinoSet See full details at (3). Horse-30 Horse-30 dataset, benchmark task is called Horse-10; See full details at (4). StanfordDogs See full details at (5, 6). AP-10K See full details at (7). iRodent We utilized the iNaturalist API functions for scraping observations with the taxon ID of Suborder Myomorpha (8). The functions allowed us to filter the large amount of observations down to the ones with photos under the CC BY-NC creative license. The most common types of rodents from the collected observations are Muskrat (Ondatra zibethicus), Brown Rat (Rattus norvegicus), House Mouse (Mus musculus), Black Rat (Rattus rattus), Hispid Cotton Rat (Sigmodon hispidus), Meadow Vole (Microtus pennsylvanicus), Bank Vole (Clethrionomys glareolus), Deer Mouse (Peromyscus maniculatus), White-footed Mouse (Peromyscus leucopus), Striped Field Mouse (Apodemus agrarius). We then generated segmentation masks over target animals in the data by processing the media through an algorithm we designed that uses a Mask Region Based Convolutional Neural Networks(Mask R-CNN) (9) model with a ResNet-50-FPN backbone (10), pretrained on the COCO datasets (11). The processed 443 images were then manually labeled with both pose annotations and segmentation masks. iRodent data is banked at https://zenodo.org/record/8250392. APT-36K See full details at (12). Here is an image with a keypoint guide.
Ethical Considerations • No experimental data was collected for this model; all datasets used are cited above.
Caveats and Recommendations • Please note that each dataest was labeled by separate labs & separate individuals, therefore while we map names to a unified pose vocabulary, there will be annotator bias in keypoint placement (See Ye et al. 2024 for our Supplementary Note on annotator bias). You will also note the dataset is highly diverse across species, but collectively has more representation of domesticated animals like dogs, cats, horses, and cattle. We recommend if performance of a model trained on this data is not as good as you need it to be, first try video adaptation (see Ye et al. 2024), or fine-tune the weights with your own labeling.
License Modified MIT.
Copyright 2023-present by Mackenzie Mathis, Shaokai Ye, and contributors.
Permission is hereby granted to you (hereafter "LICENSEE") a fully-paid, non-exclusive, and non-transferable license for academic, non-commercial purposes only (hereafter “LICENSE”) to use the "DATASET" subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software:
This data or resulting software may not be used to harm any animal deliberately.
LICENSEE acknowledges that the DATASET is a research tool. THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.
If this license is not appropriate for your application, please contact Prof. Mackenzie W. Mathis (mackenzie@post.harvard.edu) for a commercial use license.
Please cite Ye et al 2024 if you use this DATASET in your work.
References Prianka Banik, Lin Li, and Xishuang Dong. A novel dataset for keypoint detection of quadruped animals from images. ArXiv, abs/2108.13958, 2021
Jinkun Cao, Hongyang Tang, Haoshu Fang, Xiaoyong Shen, Cewu Lu, and Yu-Wing Tai. Cross-domain adaptation for animal pose estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9497–9506, 2019.
Daniel Joska, Liam Clark, Naoya Muramatsu, Ricardo Jericevich, Fred Nicolls, Alexander Mathis, Mackenzie W. Mathis, and Amir Patel. Acinoset: A 3d pose estimation dataset and baseline models for cheetahs in the wild. 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13901–13908, 2021.
Alexander Mathis, Thomas Biasi, Steffen Schneider, Mert Yuksekgonul, Byron Rogers, Matthias Bethge, and Mackenzie W Mathis. Pretraining boosts out-of-domain robustness for pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1859–1868, 2021.
Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, and Roberto Cipolla. Creatures great and smal: Recovering the shape and motion of animals from video. In Asian Conference on Computer Vision, pages 3–19. Springer, 2018. Hang Yu, Yufei Xu, Jing Zhang, Wei Zhao, Ziyu Guan, and Dacheng Tao. Ap-10k: A benchmark for animal pose estimation in the wild. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
iNaturalist. OGBIF Occurrence Download. https://doi.org/10.15468/dl.p7nbxt. iNaturalist, July 2020
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection, 2016.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll’ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014
Yuxiang Yang, Junjie Yang, Yufei Xu, Jing Zhang, Long Lan, and Dacheng Tao. Apt-36k: A large-scale benchmark for animal pose estimation and tracking. Advances in Neural Information Processing Systems, 35:17301–17313, 2022
The Fuel Economy Label and CAFE Data asset contains measured summary fuel economy estimates and test data for light-duty vehicle manufacturers by model for certification as required under the Energy Policy and Conservation Act of 1975 (EPCA) and The Energy Independent Security Act of 2007 (EISA) to collect vehicle fuel economy estimates for the creation of Economy Labels and for the calculation of Corporate Average Fuel Economy (CAFE). Manufacturers submit data on an annual basis, or as needed to document vehicle model changes.The EPA performs targeted fuel economy confirmatory tests on approximately 15% of vehicles submitted for validation. Confirmatory data on vehicles is associated with its corresponding submission data to verify the accuracy of manufacturer submissions beyond standard business rules. Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML, and includes descriptive information on the vehicle itself, fuel economy information, and the manufacturer's testing approach. This data may contain proprietary information (CBI) such as information on estimated sales or other data elements indicated by the submitter as confidential. CBI data is not publically available; however, within the EPA data can accessed under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Datasets are segmented by vehicle model/manufacturer and/or year with corresponding fuel economy, test, and certification data. Data assets are stored in EPA's Verify system.Coverage began in 1974 with early records being primarily paper documents which did not go through the same level of validation as primarily digital submissions which started in 2008. Early data is available to the public digitally starting from 1978, but more complete digital certification data is available starting in 2008. Fuel economy submission data prior to 2006 was calculated using an older formula; however, mechanisms exist to make this data comparable to current results.Fuel Economy Label and CAFE Data submission documents with metadata, certificate and summary decision information is utilized and made publically available through the EPA/DOE's Fuel Economy Guide Website (https://www.fueleconomy.gov/) as well as EPA's Smartway Program Website (https://www.epa.gov/smartway/) and Green Vehicle Guide Website (http://ofmpub.epa.gov/greenvehicles/Index.do;jsessionid=3F4QPhhYDYJxv1L3YLYxqh6J2CwL0GkxSSJTl2xgMTYPBKYS00vw!788633877) after it has been quality assured. Where summary data appears inaccurate, OTAQ returns the entries for review to their originator.