Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:
Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018
This repository dataset of machine usage includes the following columns:
+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | cpu_util_percent | bigint | | [0, 100] | | mem_util_percent | bigint | | [0, 100] | | net_in | double | | normarlized in coming network traffic, [0, 100] | | net_out | double | | normarlized out going network traffic, [0, 100] | | disk_io_percent | double | | [0, 100], abnormal values are of -1 or 101 | +--------------------------------------------------------------------------------------------+
Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.
Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/
This repository dataset of instance usage includes the following columns:
+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | avg_cpu | double | | [0, 1] | | avg_mem | double | | [0, 1] | | avg_assigned_mem | double | | [0, 1] | | avg_cycles_per_instruction | double | | [0, _] | +--------------------------------------------------------------------------------------------+
One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.
Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md
This repository dataset of instance usage includes the following columns:
+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | cpu_usage | double | | [0, _] | | assigned_mem | double | | [0, _] | +--------------------------------------------------------------------------------------------+
One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp
Access Level
The dataset is freely accessible under an Open Access model. There are no restrictions for reuse, and it is licensed under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dementia is a progressive condition that affects cognitive and functional abilities. There is a need for reliable and continuous health monitoring of People Living with Dementia (PLWD) to improve their quality of life and support their independent living. Healthcare services often focus on addressing and treating already established health conditions that affect PLWD. Managing these conditions continuously can inform better decision-making earlier for higher-quality care management for PLWD. The Technology Integrated Health Management (TIHM) project developed a new digital platform to routinely collect longitudinal observational and measurement data within the home and apply machine learning and analytical models for the detection and prediction of adverse health events affecting the well-being of PLWD. This work describes the TIHM dataset collected during the second phase (i.e., feasibility study) of the TIHM project. The data was collected from homes of 56 PLWD and associated with events and clinical observations (daily activity, physiological monitoring, and labels for health-related conditions). The study recorded an average of 50 days of data per participant, totalling 2803 days.
We have provided raw data and guidelines on how to access, visualise, manipulate and predict health-related events within the dataset, available on the Github repository. The Jupyter Notebooks have been developed using Python 3.9.
The dataset is provided for research and patient benefit purposes.
Please acknowledge the Surrey and Borders Partnership NHS Foundation Trust in any publication or use of this dataset.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions (MIMII DUE). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, pumps, slide rails, and valves. The data for each machine type includes six subsets called ``sections'', and each section roughly corresponds to a single product. Each section consists of data from two domains, called the source domain and the target domain, with different conditions such as operating speed and environmental noise. This dataset is a subset of the dataset for DCASE 2021 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset and additional training dataset. For more information, please see this paper and the pages of the development dataset and the task description for DCASE 2021 Challenge Task 2.
Baseline system
Two simple baseline systems are available on the Github repositories [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Conditions of use
This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Publication
If you use this dataset, please cite the following paper:
Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi, "MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions," arXiv preprint arXiv: 2105.02702, 2021. [URL]
Feedback
If there is any problem, please contact us:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset features three gridded population dadasets of Germany on a 10m grid. The units are people per grid cell.
Datasets
DE_POP_VOLADJ16: This dataset was produced by disaggregating national census counts to 10m grid cells based on a weighted dasymetric mapping approach. A building density, building height and building type dataset were used as underlying covariates, with an adjusted volume for multi-family residential buildings.
DE_POP_TDBP: This dataset is considered a best product, based on a dasymetric mapping approach that disaggregated municipal census counts to 10m grid cells using the same three underyling covariate layers.
DE_POP_BU: This dataset is based on a bottom-up gridded population estimate. A building density, building height and building type layer were used to compute a living floor area dataset in a 10m grid. Using federal statistics on the average living floor are per capita, this bottom-up estimate was created.
Please refer to the related publication for details.
Temporal extent
The building density layer is based on Sentinel-2 time series data from 2018 and Sentinel-1 time series data from 2017 (doi: http://doi.org/10.1594/PANGAEA.920894)
The building height layer is representative for ca. 2015 (doi: 10.5281/zenodo.4066295)
The building types layer is based on Sentinel-2 time series data from 2018 and Sentinel-1 time series data from 2017 (doi: 10.5281/zenodo.4601219)
The underlying census data is from 2018.
Data format
The data come in tiles of 30x30km (see shapefile). The projection is EPSG:3035. The images are compressed GeoTiff files (*.tif). There is a mosaic in GDAL Virtual format (*.vrt), which can readily be opened in most Geographic Information Systems.
Further information
For further information, please see the publication or contact Franz Schug (franz.schug@geo.hu-berlin.de).
A web-visualization of this dataset is available here.
Publication
Schug, F., Frantz, D., van der Linden, S., & Hostert, P. (2021). Gridded population mapping for Germany based on building density, height and type from Earth Observation data using census disaggregation and bottom-up estimates. PLOS ONE. DOI: 10.1371/journal.pone.0249044
Acknowledgements
Census data were provided by the German Federal Statistical Offices.
Funding
This dataset was produced with funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (MAT_STOCKS, grant agreement No 741950).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].
This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/
*1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.
[1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.
[2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description "NCT-CRC-HE-100K"
Ethics statement "NCT-CRC-HE-100K"
All experiments were conducted in accordance with the Declaration of Helsinki, the International Ethical Guidelines for Biomedical Research Involving Human Subjects (CIOMS), the Belmont Report and the U.S. Common Rule. Anonymized archival tissue samples were retrieved from the tissue bank of the National Center for Tumor diseases (NCT, Heidelberg, Germany) in accordance with the regulations of the tissue bank and the approval of the ethics committee of Heidelberg University (tissue bank decision numbers 2152 and 2154, granted to Niels Halama and Jakob Nikolas Kather; informed consent was obtained from all patients as part of the NCT tissue bank protocol, ethics board approval S-207/2005, renewed on 20 Dec 2017). Another set of tissue samples was provided by the pathology archive at UMM (University Medical Center Mannheim, Heidelberg University, Mannheim, Germany) after approval by the institutional ethics board (Ethics Board II at University Medical Center Mannheim, decision number 2017-806R-MA, granted to Alexander Marx and waiving the need for informed consent for this retrospective and fully anonymized analysis of archival samples).
Data set "CRC-VAL-HE-7K"
This is a set of 7180 image patches from N=50 patients with colorectal adenocarcinoma (no overlap with patients in NCT-CRC-HE-100K). It can be used as a validation set for models trained on the larger data set. Like in the larger data set, images are 224x224 px at 0.5 MPP. All tissue samples were provided by the NCT tissue bank, see above for further details and ethics statement.
Data set "NCT-CRC-HE-100K-NONORM"
This is a slightly different version of the "NCT-CRC-HE-100K" image set: This set contains 100,000 images in 9 tissue classes at 0.5 MPP and was created from the same raw data as "NCT-CRC-HE-100K". However, no color normalization was applied to these images. Consequently, staining intensity and color slightly varies between the images. Please note that although this image set was created from the same data as "NCT-CRC-HE-100K", the image regions are not completely identical because the selection of non-overlapping tiles from raw images was a stochastic process.
General comments
Please note that the classes are only roughly balanced. Classifiers should never be evaluated based on accuracy in the full set alone. Also, if a high risk of training bias is excepted, balancing the number of cases per class is recommended.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:
Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018
This repository dataset of machine usage includes the following columns:
+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | cpu_util_percent | bigint | | [0, 100] | | mem_util_percent | bigint | | [0, 100] | | net_in | double | | normarlized in coming network traffic, [0, 100] | | net_out | double | | normarlized out going network traffic, [0, 100] | | disk_io_percent | double | | [0, 100], abnormal values are of -1 or 101 | +--------------------------------------------------------------------------------------------+
Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.
Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/
This repository dataset of instance usage includes the following columns:
+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | avg_cpu | double | | [0, 1] | | avg_mem | double | | [0, 1] | | avg_assigned_mem | double | | [0, 1] | | avg_cycles_per_instruction | double | | [0, _] | +--------------------------------------------------------------------------------------------+
One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.
Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md
This repository dataset of instance usage includes the following columns:
+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | cpu_usage | double | | [0, _] | | assigned_mem | double | | [0, _] | +--------------------------------------------------------------------------------------------+
One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp
Access Level
The dataset is freely accessible under an Open Access model. There are no restrictions for reuse, and it is licensed under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).