Advanced Diagnostics and Prognostics Testbed (ADAPT) Project Lead: Scott Poll Subject Fault diagnosis in electrical power systems Description The Advanced Diagnostics and Prognostics Testbed (ADAPT) lab at the NASA Ames Research Center aims to provide a means to assess the effectiveness of diagnostic algorithms at detecting faults in power systems. The algorithms are evaluated using data from the Electrical Power System (EPS), which simulates the functions of a typical aerospace vehicle power system. The EPS allows for the controlled insertion of faults in repeatable failure scenarios to test if diagnostic algorithms can detect and isolate these faults. How Data Was Acquired This dataset was generated from the EPS in the ADAPT lab. Each data file corresponds to one experimental run of the testbed. During an experiment, a data acquisition system commands the testbed into different configurations and records data from sensors that measure system variables such as voltages, currents, temperatures and switch positions. Faults were injected in some of the experimental runs. Sample Rates and Parameter Descriptions Data was sampled at a rate of 2 Hz and saved into a tab delimited plain text file. There are a total of 128 sensors and typical experimental runs last for approximately five minutes. The text files have also been converted into a MATLAB environment file containing equivalent data that may be imported for viewing or computation. Faults and Anomalies Faults were injected into the EPS using physical or software means. Physical faults include disconnecting sources, sinks or circuit breakers. For software faults, user commands are passed through an Antagonist function before being received by the EPS, and sensor data is filtered through the same function before being seen by the user. The Antagonist function was able to block user commands, send spurious commands and alter sensor data. External Links Additional data from the ADAPT EPS testbed can be found at the DXC competition page - https://dashlink.arc.nasa.gov/topic/diagnostic-challenge-competition/ Other Notes The HTML diagrams can be viewed in any brower, but its active content is best run on Internet Explorer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Pepsi Can Detection is a dataset for object detection tasks - it contains Pepsi annotations for 200 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions or anomalies on CANs. Producing vehicular CAN data with a variety of intrusions is a difficult task for most researchers as it requires expensive assets and deep expertise. To illuminate this task, we introduce the first comprehensive guide to the existing open CAN intrusion detection system (IDS) datasets. We categorize attacks on CANs including fabrication (adding frames, e.g., flooding or targeting and ID), suspension (removing an ID’s frames), and masquerade attacks (spoofed frames sent in lieu of suspended ones). We provide a quality analysis of each dataset; an enumeration of each datasets’ attacks, benefits, and drawbacks; categorization as real vs. simulated CAN data and real vs. simulated attacks; whether the data is raw CAN data or signal-translated; number of vehicles/CANs; quantity in terms of time; and finally a suggested use case of each dataset. State-of-the-art public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, lacking fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but is missing a corresponding “raw” binary version. This issue pigeon-holes CAN IDS research into testing on limited and often inappropriate data (usually with attacks that are too easily detectable to truly test the method). The scarcity of appropriate data has stymied comparability and reproducibility of results for researchers. As our primary contribution, we present the Real ORNL Automotive Dynamometer (ROAD) CAN IDS dataset, consisting of over 3.5 hours of one vehicle’s CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real (i.e. non-simulated) fuzzing, fabrication, unique advanced attacks, and simulated masquerade attacks. To facilitate a benchmark for CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS research field.
https://brightdata.com/licensehttps://brightdata.com/license
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features
Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.
Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases
Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.
Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By US Open Data Portal, data.gov [source]
This dataset contains in-depth facility-level information on industrial combustion energy use in the United States. It provides an essential resource for understanding consumption patterns across different sectors and industries, as reported by large emitters (>25,000 metric tons CO2e per year) under the U.S. EPA's Greenhouse Gas Reporting Program (GHGRP). Our records have been calculated using EPA default emissions factors and contain data on fuel type, location (latitude, longitude), combustion unit type and energy end use classified by manufacturing NAICS code. Additionally, our dataset reveals valuable insight into the thermal spectrum of low-temperature energy use from a 2010 Energy Information Administration Manufacturing Energy Consumption Survey (MECS). This information is critical to assessing industrial trends of energy consumption in manufacturing sectors and can serve as an informative baseline for efficient or renewable alternative plans of operation at these facilities. With this dataset you're just a few clicks away from analyzing research questions related to consumption levels across industries, waste issues associated with unconstrained fossil fuel burning practices and their environmental impacts
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides detailed information on industrial combustion energy end use in the United States. Knowing how certain industries use fuel can be valuable for those interested in reducing energy consumption and its associated environmental impacts.
To make the most out of this dataset, users should first become familiar with what's included by looking at the columns and their respective definitions. After becoming familiar with the data, users should start to explore areas of interest such as Fuel Type, Report Year, Primary NAICS Code, Emissions Indicators etc. The more granular and specific details you can focus on will help build a stronger analysis from which to draw conclusions from your data set.
Next steps could include filtering your data set down by region or end user type (such as direct related processes or indirect support activities). Segmenting your data set further can allow you to identify trends between fuel type used in different regions or compare emissions indicators between different processes within manufacturing industries etc. By taking a closer look through this lens you may be able to find valuable insights that can help inform better decision making when it comes to reducing energy consumption throughout industry in both public and private sectors alike.
if exploring specific trends within industry is not something that’s of particular interest to you but rather understanding general patterns among large emitters across regions then it may be beneficial for your analysis to group like-data together and take averages over larger samples which better represent total production across an area or multiple states (timeline varies depending on needs). This approach could open up new possibilities for exploring correlations between economic productivity metrics compared against industrial energy use over periods of time which could lead towards more formal investigations about where efforts are being made towards improved resource efficiency standards among certain industries/areas of production compared against other more inefficient sectors/regionsetc — all from what's already present here!
By leveraging the information provided within this dataset users have access to many opportunities for finding all sorts of interesting yet practical insights which can have important impacts far beyond understanding just another singular statistic alone; so happy digging!
- Analyzing the trends in combustion energy uses by region across different industries.
- Predicting the potential of transitioning to clean and renewable sources of energy considering the current end-uses and their magnitude based on this data.
- Creating an interactive web map application to visualize multiple industrial sites, including their energy sources and emissions data from this dataset combined with other sources (EPA’s GHGRP, MECS survey, etc)
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons...
The Canadian Biomarker Integration Network in Depression (CAN-BIND) is a national program of research and learning. From 2013 to 2017, data were collected from 211 participants with major depressive disorder and 112 healthy individuals. The objective of this data-set is to integrate detailed clinical, imaging, and molecular data to predict outcome for patients experiencing a Major Depressive Episode (MDE) and receiving pharmacotherapy reflective of standard practice. The clinical characterization consists of symptom assessment, behavioural dimensions, and environmental factors. The neuroimaging data consist of structural, resting and task-based functional, and diffusion-weighted MRI images, as well as scalp-recorded EEG data. The molecular data currently consist of DNA methylation, inflammatory markers and urine metabolites. Baseline and Phase 1 (Weeks 2-8) data are now available for request.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by over 1,400 OSDG Community Platform (OSDG-CP) citizen scientists from over 140 countries, with respect to the Sustainable Development Goals (SDGs).
Dataset Information
In support of the global effort to achieve the Sustainable Development Goals (SDGs), OSDG is realising a series of SDG-labelled text datasets. The OSDG Community Dataset (OSDG-CD) is the direct result of the work of more than 1,400 volunteers from over 130 countries who have contributed to our understanding of SDGs via the OSDG Community Platform (OSDG-CP). The dataset contains tens of thousands of text excerpts (henceforth: texts) which were validated by the Community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches.
📘 The file contains 43,0210 (+390) text excerpts and a total of 310,328 (+3,733) assigned labels.
To learn more about the project, please visit the OSDG website and the official GitHub page. Explore a detailed overview of the OSDG methodology in our recent paper "OSDG 2.0: a multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs)".
Source Data
The dataset consists of paragraph-length text excerpts derived from publicly available documents, including reports, policy documents and publication abstracts. A significant number of documents (more than 3,000) originate from UN-related sources such as SDG-Pathfinder and SDG Library. These sources often contain documents that already have SDG labels associated with them. Each text is comprised of 3 to 6 sentences and is about 90 words on average.
Methodology
All the texts are evaluated by volunteers on the OSDG-CP. The platform is an ambitious attempt to bring together researchers, subject-matter experts and SDG advocates from all around the world to create a large and accurate source of textual information on the SDGs. The Community volunteers use the platform to participate in labelling exercises where they validate each text's relevance to SDGs based on their background knowledge.
In each exercise, the volunteer is shown a text together with an SDG label associated with it – this usually comes from the source – and asked to either accept or reject the suggested label.
There are 3 types of exercises:
All volunteers start with the mandatory introductory exercise that consists of 10 pre-selected texts. Each volunteer must complete this exercise before they can access 2 other exercise types. Upon completion, the volunteer reviews the exercise by comparing their answers with the answers of the rest of the Community using aggregated statistics we provide, i.e., the share of those who accepted and rejected the suggested SDG label for each of the 10 texts. This helps the volunteer to get a feel for the platform.
SDG-specific exercises where the volunteer validates texts with respect to a single SDG, e.g., SDG 1 No Poverty.
All SDGs exercise where the volunteer validates a random sequence of texts where each text can have any SDG as its associated label.
After finishing the introductory exercise, the volunteer is free to select either SDG-specific or All SDGs exercises. Each exercise, regardless of its type, consists of 100 texts. Once the exercise is finished, the volunteer can either label more texts or exit the platform. Of course, the volunteer can finish the exercise early. All progress is saved and recorded still.
To ensure quality, each text is validated by up to 9 different volunteers and all texts included in the public release of the data have been validated by at least 3 different volunteers.
It is worth keeping in mind that all exercises present the volunteers with a binary decision problem, i.e., either accept or reject a suggested label. The volunteers are never asked to select one or more SDGs that a certain text might relate to. The rationale behind this set-up is that asking a volunteer to select from 17 SDGs is extremely inefficient. Currently, all texts are validated against only one associated SDG label.
Column Description
doi - Digital Object Identifier of the original document
text_id - unique text identifier
text - text excerpt from the document
sdg - the SDG the text is validated against
labels_negative - the number of volunteers who rejected the suggested SDG label
labels_positive - the number of volunteers who accepted the suggested SDG label
agreement - agreement score based on the formula (agreement = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}})
Further Information
Do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. All queries can be directed to community@osdg.ai.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "additional training dataset" for the DCASE 2024 Challenge Task 2.
The data consists of the normal/anomalous operating sounds of nine types of real/toy machines. Each recording is a single-channel audio that includes both a machine's operating sound and environmental noise. The duration of recordings varies from 6 to 10 seconds. The following nine types of real/toy machines are used in this task:
3DPrinter
AirCompressor
BrushlessMotor
HairDryer
HoveringDrone
RoboticArm
Scanner
ToothBrush
ToyCircuit
Overview of the task
Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.
This task is the follow-up from DCASE 2020 Task 2 to DCASE 2023 Task 2. The task this year is to develop an ASD system that meets the following five requirements.
Train a model using only normal sound (unsupervised learning scenario) Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.
Detect anomalies regardless of domain shifts (domain generalization task) In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2 and DCASE 2023 Task 2.
Train a model for a completely new machine typeFor a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same as in DCASE 2023 Task 2.
Train a model using a limited number of machines from its machine typeWhile sounds from multiple machines of the same machine type can be used to enhance the detection performance, it is often the case that only a limited number of machines are available for a machine type. In such a case, the system should be able to train models using a few machines from a machine type. This requirement is the same as in DCASE 2023 Task 2.
5 . Train a model both with or without attribute informationWhile additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.
The last requirement is newly introduced in DCASE 2024 Task2.
Definition
We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".
"Machine type" indicates the type of machine, which in the additional training dataset is one of nine: 3D-printer, air compressor, brushless motor, hair dryer, hovering drone, robotic arm, document scanner (scanner), toothbrush, and Toy circuit.
A section is defined as a subset of the dataset for calculating performance metrics.
The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.
Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.
Dataset
This dataset consists of nine machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2024 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.
File names and attribute csv files
File names and attribute csv files provide reference labels for each clip. The given reference labels for each training clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:
[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.
Recording procedure
Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.
Directory structure
/eval_data
Baseline system
The baseline system is available on the Github repository . The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Condition of use
This dataset was created jointly by Hitachi, Ltd., NTT Corporation and STMicroelectronics and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Citation
Contact
If there is any problem, please contact us:
Tomoya Nishida, tomoya.nishida.ax@hitachi.com
Keisuke Imoto, keisuke.imoto@ieee.org
Noboru Harada, noboru@ieee.org
Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
By Health [source]
The Behavioral Risk Factor Surveillance System (BRFSS) offers an expansive collection of data on the health-related quality of life (HRQOL) from 1993 to 2010. Over this time period, the Health-Related Quality of Life dataset consists of a comprehensive survey reflecting the health and well-being of non-institutionalized US adults aged 18 years or older. The data collected can help track and identify unmet population health needs, recognize trends, identify disparities in healthcare, determine determinants of public health, inform decision making and policy development, as well as evaluate programs within public healthcare services.
The HRQOL surveillance system has developed a compact set of HRQOL measures such as a summary measure indicating unhealthy days which have been validated for population health surveillance purposes and have been widely implemented in practice since 1993. Within this study's dataset you will be able to access information such as year recorded, location abbreviations & descriptions, category & topic overviews, questions asked in surveys and much more detailed information including types & units regarding data values retrieved from respondents along with their sample sizes & geographical locations involved!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset tracks the Health-Related Quality of Life (HRQOL) from 1993 to 2010 using data from the Behavioral Risk Factor Surveillance System (BRFSS). This dataset includes information on the year, location abbreviation, location description, type and unit of data value, sample size, category and topic of survey questions.
Using this dataset on BRFSS: HRQOL data between 1993-2010 will allow for a variety of analyses related to population health needs. The compact set of HRQOL measures can be used to identify trends in population health needs as well as determine disparities among various locations. Additionally, responses to survey questions can be used to inform decision making and program and policy development in public health initiatives.
- Analyzing trends in HRQOL over the years by location to identify disparities in health outcomes between different populations and develop targeted policy interventions.
- Developing new models for predicting HRQOL indicators at a regional level, and using this information to inform medical practice and public health implementation efforts.
- Using the data to understand differences between states in terms of their HRQOL scores and establish best practices for healthcare provision based on that understanding, including areas such as access to care, preventative care services availability, etc
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: rows.csv | Column name | Description | |:-------------------------------|:----------------------------------------------------------| | Year | Year of survey. (Integer) | | LocationAbbr | Abbreviation of location. (String) | | LocationDesc | Description of location. (String) | | Category | Category of survey. (String) | | Topic | Topic of survey. (String) | | Question | Question asked in survey. (String) | | DataSource | Source of data. (String) | | Data_Value_Unit | Unit of data value. (String) | | Data_Value_Type | Type of data value. (String) | | Data_Value_Footnote_Symbol | Footnote symbol for data value. (String) | | Data_Value_Std_Err | Standard error of the data value. (Float) | | Sample_Size | Sample size used in sample. (Integer) | | Break_Out | Break out categories used. (String) | | Break_Out_Category | Type break out assessed. (String) | | **GeoLocation*...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BIQ2021 dataset is a large-scale blind image quality assessment database, consisting of 12,000 authentically distorted images. Each image in the dataset has been quality rated by 30 observers, resulting in a total of 360,000 quality ratings. This dataset was created in a controlled laboratory environment, ensuring consistent and reliable subjective scoring. Moreover, the dataset provide a train/test split by which the researchers can report their results for benchmarking. The dataset is openly available and serves as a valuable resource for evaluating and benchmarking image quality assessment algorithms. The paper providing a detailed description of the dataset and its creation process is openly accessible at the following link: BIQ2021: A large-scale blind image quality assessment database.
The paper can be sited as:
Ahmed, N., & Asif, S. (2022). BIQ2021: a large-scale blind image quality assessment database. Journal of Electronic Imaging, 31(5), 053010.
Images: The dataset contain a folder named images containing 12,000 images to be used for training and testing. Train (Images and MOS): It is a CSV file containing randomly partitioned train set of the dataset containing 10,000 images with their corresponding MOS. Test (Images and MOS): It is a CSV file containing randomly partitioned test set of the dataset containing 2,000 images with their corresponding MOS.
Benchmarking: In order to compare the performance of a predictive model trained on the dataset, Pearson and Spearman's correlation can be computed and compared with the existing approaches and the CNN models listed at the following gitHub repository: https://github.com/nisarahmedrana/BIQ2021
This Object Detection dataset is a collection of ancient coin images from three different sources: the Corpus Nummorum (CN) project, the Münzkabinett Berlin and the Bibliothèque nationale de France, Département des Monnaies, médailles et antiques. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. This is a selection of the coins published on the CN portal (due to copyrights).
This dataset contains 506 different classes with about 179.000 coin images (approx. 29.000 unique coins). The classes come from four different categories: persons, objects, animals and plants. The coin images were assigned to the classes using our NLP pipeline. For this purpose, our Named Entity Recognition and Relation Extraction were performed on every coin's description (separated into obverse and reverse). Each coin image assigned to this description was then copied to the folder of the predicted classes. A coin image can therefore also be assigned to different classes. The file name contains both the coin id and the coin type of the CN database. Whether the image belongs to a coin obverse or reverse can be recognized by the suffix obv or rev. An "sources" csv file holds the sources for every image. Due to copyrights the image size is limited to 299*299 pixels. However, this should be sufficient for most ML approaches.
Due to the numerically different occurrences of the individual entities, the data set is not balanced. In addition, a class can contain very different representations of the same entity. Therefore, some classes can be difficult to train. Unfortunately, we cannot provide any annotations for the data set.
During the summer semester 2024, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. Our students could choose between the Object Detection dataset and a Natural Language dataset as their challenge. One team opted for the Object Detection challenge. We gave them this dataset with the task to use to try out their own ideas. Here are their results:
Multilabel Classification as Backbone for Object Detection
Now we would like to invite you to try out your own ideas and models on our coin data.
If you have any questions or suggestions, please, feel free to contact us.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-java"
Dataset Summary
This is an instructional dataset for Java. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
This dataset contains quasi-dynamically downscaled climate data from 8 global climate models from the CMIP5 archive. The data were downscaled using the Intermediate Complexity Atmospheric Research (ICAR Gutmann et al., ... 2016) model after bias correcting the GCM three dimensional atmospheric data to match the ERA-interim reanalysis climatology (Dee et al., 2011). ICAR was configured with the Thompson microphysics, a simple PBL parameterization based on YSU, the RRTMG longwave radiation and an empirical shortwave radiation scheme, the Noah-MP land surface model, the WRF-Lake model, the BMJ cumulus parameterization, and used linear mountain wave theory for upper level wind structures combined with an iterative scheme to remove vertical motion at the model top. The output from ICAR was adjusted to match the climatological statistics of the Livneh et al. (2015) observational dataset.
https://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdfhttps://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdf
This file contains raw data for cameras and wearables of the ConfLab dataset.
./cameras
contains the overhead video recordings for 9 cameras (cam2-10) in MP4 files.
These cameras cover the whole interaction floor, with camera 2 capturing the
bottom of the scene layout, and camera 10 capturing top of the scene layout.
Note that cam5 ran out of battery before the other cameras and thus the recordings
are cut short. However, cam4 and 6 contain significant overlap with cam 5, to
reconstruct any information needed.
Note that the annotations are made and provided in 2 minute segments.
The annotated portions of the video include the last 3min38sec of x2xxx.MP4
video files, and the first 12 min of x3xxx.MP4 files for cameras (2,4,6,8,10),
with "x" being the placeholder character in the mp4 file names. If one wishes
to separate the video into 2 min segments as we did, the "video-splitting.sh"
script is provided.
./camera-calibration contains the camera instrinsic files obtained from
https://github.com/idiap/multicamera-calibration. Camera extrinsic parameters can
be calculated using the existing intrinsic parameters and the instructions in the
multicamera-calibration repo. The coordinates in the image are provided by the
crosses marked on the floor, which are visible in the video recordings.
The crosses are 1m apart (=100cm).
./wearables
subdirectory includes the IMU, proximity and audio data from each
participant at the Conflab event (48 in total). In the directory numbered
by participant ID, the following data are included:
1. raw audio file
2. proximity (bluetooth) pings (RSSI) file (raw and csv) and a visualization
3. Tri-axial accelerometer data (raw and csv) and a visualization
4. Tri-axial gyroscope data (raw and csv) and a visualization
5. Tri-axial magnetometer data (raw and csv) and a visualization
6. Game rotation vector (raw and csv), recorded in quaternions.
All files are timestamped.
The sampling frequencies are:
- audio: 1250 Hz
- rest: around 50Hz. However, the sample rate is not fixed
and instead the timestamps should be used.
For rotation, the game rotation vector's output frequency is limited by the
actual sampling frequency of the magnetometer. For more information, please refer to
https://invensense.tdk.com/wp-content/uploads/2016/06/DS-000189-ICM-20948-v1.3.pdf
Audio files in this folder are in raw binary form. The following can be used to convert
them to WAV files (1250Hz):
ffmpeg -f s16le -ar 1250 -ac 1 -i /path/to/audio/file
Synchronization of cameras and werables data
Raw videos contain timecode information which matches the timestamps of the data in
the "wearables" folder. The starting timecode of a video can be read as:
ffprobe -hide_banner -show_streams -i /path/to/video
./audio
./sync: contains wav files per each subject
./sync_files: auxiliary csv files used to sync the audio. Can be used to improve the synchronization.
The code used for syncing the audio can be found here:
https://github.com/TUDelft-SPC-Lab/conflab/tree/master/preprocessing/audio
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Ventura by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Ventura across both sexes and to determine which sex constitutes the majority.
Key observations
There is a slight majority of male population, with 50.28% of total population being male. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Ventura Population by Race & Ethnicity. You can refer the same here
https://brightdata.com/licensehttps://brightdata.com/license
Use our Instagram dataset (public data) to extract business and non-business information from complete public profiles and filter by hashtags, followers, account type, or engagement score. Depending on your needs, you may purchase the entire dataset or a customized subset. Popular use cases include sentiment analysis, brand monitoring, influencer marketing, and more. The dataset includes all major data points: # of followers, verified status, account type (business / non-business), links, posts, comments, location, engagement score, hashtags, and much more.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation
Increasing heat stress due to climate change poses significant risks to human health and can lead to widespread social and economic consequences. Evaluating these impacts requires reliable datasets of heat stress projections.
Data Record
We present a global dataset projecting future dry-bulb, wet-bulb, and wet-bulb globe temperatures under 1-4°C global warming scenarios (at 0.5°C intervals) relative to the preindustrial era, using outputs from 16 CMIP6 global climate models (GCMs) (Table 1). All variables were retrieved from the historical and SSP585 scenarios which were selected to maximize the warming signal.
The dataset was bias-corrected against ERA5 reanalysis by incorporating the GCM-simulated climate change signal onto the ERA5 baseline (1950-1976) at a 3-hourly frequency. It therefore includes a 27-year sample for each GCM under each warming target.
The data is provided at a fine spatial resolution of 0.25° x 0.25° and a temporal resolution of 3 hours, and is stored in a self-describing NetCDF format. Filenames follow the pattern "VAR_bias_corrected_3hr_GCM_XC_yyyy.nc", where:
"VAR" represents the variable (Ta, Tw, WBGT for dry-bulb, wet-bulb, and wet-bulb globe temperature, respectively),
"GCM" denotes the CMIP6 GCM name,
"X" indicates the warming target compared to the preindustrial period,
"yyyy" represents the year index (0001-0027) of the 27-year sample
Table 1 CMIP6 GCMs used for generating the dataset for Ta, Tw and WBGT.
GCM |
Realization |
GCM grid spacing |
Ta |
Tw |
WBGT |
ACCESS-CM2 |
r1i1p1f1 |
1.25ox1.875o |
✓ |
✓ |
✓ |
BCC-CSM2-MR |
r1i1p1f1 |
1.1ox1.125o |
✓ |
✓ |
✓ |
CanESM5 |
r1i1p2f1 |
2.8ox2.8o |
✓ |
✓ |
✓ |
CMCC-CM2-SR5 |
r1i1p1f1 |
0.94ox1.25o |
✓ |
✓ |
✓ |
CMCC-ESM2 |
r1i1p1f1 |
0.94ox1.25o |
✓ |
✓ |
✓ |
CNRM-CM6-1 |
r1i1p1f2 |
1.4ox1.4o |
✓ |
✓ | |
EC-Earth3 |
r1i1p1f1 |
0.7ox0.7o |
✓ |
✓ |
✓ |
GFDL-ESM4 |
r1i1p1f1 |
1.0ox1.25o |
✓ |
✓ |
✓ |
HadGEM3-GC31-LL |
r1i1p1f3 |
1.25ox1.875o |
✓ |
✓ |
✓ |
HadGEM3-GC31-MM |
r1i1p1f3 |
0.55ox0.83o |
✓ |
✓ |
✓ |
KACE-1-0-G |
r1i1p1f1 |
1.25ox1.875o |
✓ |
✓ |
✓ |
KIOST-ESM |
r1i1p1f1 |
1.9ox1.9o |
✓ |
✓ |
✓ |
MIROC-ES2L |
r1i1p1f2 |
2.8ox2.8o |
✓ |
✓ |
✓ |
MIROC6 |
r1i1p1f1 |
1.4ox1.4o |
✓ |
✓ |
✓ |
MPI-ESM1-2-HR |
r1i1p1f1 |
0.93ox0.93o |
✓ |
✓ |
✓ |
MPI-ESM1-2-LR |
r1i1p1f1 |
1.85ox1.875o |
✓ |
✓ |
✓ |
Data Access
An inventory of the dataset is available in this repository. The complete dataset, approximately 57 TB in size, is freely accessible via Purdue Fortress' long-term archive through Globus at Globus Link. After clicking the link, users may be prompted to log in with a Purdue institutional Globus account. You can switch to your institutional account, or log in via a personal Globus ID, Gmail, GitHub handle, or ORCID ID. Alternatively, the dataset can be accessed by searching for the universally unique identifier (UUID): "6538f53a-1ea7-4c13-a0cf-10478190b901" in Globus.
Dataset Validation
We validate the bias-correction method and show that it significantly enhances the GCMs' accuracy in reproducing both the annual average and the full range of quantiles for all metrics within an ERA5 reference climate state. This dataset is expected to support future research on projected changes in mean and extreme heat stress and the assessment of related health and socio-economic impacts.
For a detailed introduction to the dataset and its validation, please refer to our data descriptor currently under review at Scientific Data. We will update this information upon publication.
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Background Acute compartment syndrome (ACS) is an emergency orthopaedic condition wherein a rapid rise in compartmental pressure compromises blood perfusion to the tissues leading to ischaemia and muscle necrosis. This serious condition is often misdiagnosed or associated with significant diagnostic delay, and can lead to limb amputations and death.
The most common causes of ACS are high impact trauma, especially fractures of the lower limbs which account for 40% of ACS cases. ACS is a challenge to diagnose and treat effectively, with differing clinical thresholds being utilised which can result in unnecessary osteotomy. The highly granular synthetic data for over 900 patients with ACS provide the following key parameters to support critical research into this condition:
PIONEER geography: The West Midlands (WM) has a population of 5.9 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & an expanded 250 ITU bed capacity during COVID. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.
Scope: Enabling data-driven research and machine learning models towards improving the diagnosis of Acute compartment syndrome. Longitudinal & individually linked, so that the preceding & subsequent health journey can be mapped & healthcare utilisation prior to & after admission understood. The dataset includes highly granular patient demographics, physiological parameters, muscle biomarkers, blood biomarkers and co-morbidities taken from ICD-10 & SNOMED-CT codes. Serial, structured data pertaining to process of care (timings and admissions), presenting complaint, lab analysis results (eGFR, troponin, CRP, INR, ABG glucose), systolic and diastolic blood pressures, procedures and surgery details.
Available supplementary data: ACS cohort, Matched controls; ambulance, OMOP data. Available supplementary support: Analytics, Model build, validation & refinement; A.I.; Data partner support for ETL (extract, transform & load) process, Clinical expertise, Patient & end-user access, Purchaser access, Regulatory requirements, Data-driven trials, “fast screen” services.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is You can never be too rich : essential investing advice you cannot afford to overlook. It features 7 columns including author, publication date, language, and book publisher.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
NetConfEval: Can LLMs Facilitate Network Configuration?
What is it?
We present a set of benchmarks (NetConfEval) to examine the effectiveness of different models in facilitating and automating network configuration described in our paper "NetConfEval: Can LLMs Facilitate Network Configuration?". 📜 Paper - GitHub Repository This repository contains pre-generated datasets for each of the benchmark task, so that they can be used independently from our testing environment.… See the full description on the dataset page: https://huggingface.co/datasets/NetConfEval/NetConfEval.
Advanced Diagnostics and Prognostics Testbed (ADAPT) Project Lead: Scott Poll Subject Fault diagnosis in electrical power systems Description The Advanced Diagnostics and Prognostics Testbed (ADAPT) lab at the NASA Ames Research Center aims to provide a means to assess the effectiveness of diagnostic algorithms at detecting faults in power systems. The algorithms are evaluated using data from the Electrical Power System (EPS), which simulates the functions of a typical aerospace vehicle power system. The EPS allows for the controlled insertion of faults in repeatable failure scenarios to test if diagnostic algorithms can detect and isolate these faults. How Data Was Acquired This dataset was generated from the EPS in the ADAPT lab. Each data file corresponds to one experimental run of the testbed. During an experiment, a data acquisition system commands the testbed into different configurations and records data from sensors that measure system variables such as voltages, currents, temperatures and switch positions. Faults were injected in some of the experimental runs. Sample Rates and Parameter Descriptions Data was sampled at a rate of 2 Hz and saved into a tab delimited plain text file. There are a total of 128 sensors and typical experimental runs last for approximately five minutes. The text files have also been converted into a MATLAB environment file containing equivalent data that may be imported for viewing or computation. Faults and Anomalies Faults were injected into the EPS using physical or software means. Physical faults include disconnecting sources, sinks or circuit breakers. For software faults, user commands are passed through an Antagonist function before being received by the EPS, and sensor data is filtered through the same function before being seen by the user. The Antagonist function was able to block user commands, send spurious commands and alter sensor data. External Links Additional data from the ADAPT EPS testbed can be found at the DXC competition page - https://dashlink.arc.nasa.gov/topic/diagnostic-challenge-competition/ Other Notes The HTML diagrams can be viewed in any brower, but its active content is best run on Internet Explorer.