6 datasets found
  1. DataCenter-Traces-Datasets

    • zenodo.org
    bin, csv
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero (2024). DataCenter-Traces-Datasets [Dataset]. http://doi.org/10.5281/zenodo.14564935
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Dec 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 17, 2022
    Description

    Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:

    Alibaba 2018 machine usage

    Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

    This repository dataset of machine usage includes the following columns:

    +--------------------------------------------------------------------------------------------+
    | Field      | Type    | Label | Comment                      |
    +--------------------------------------------------------------------------------------------+
    | cpu_util_percent | bigint   |    | [0, 100]                      |
    | mem_util_percent | bigint   |    | [0, 100]                      |
    | net_in      | double   |    | normarlized in coming network traffic, [0, 100]  |
    | net_out     | double   |    | normarlized out going network traffic, [0, 100]  |
    | disk_io_percent | double   |    | [0, 100], abnormal values are of -1 or 101     |
    +--------------------------------------------------------------------------------------------+

    Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

    Google 2019 instance usage

    Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

    This repository dataset of instance usage includes the following columns:

    +--------------------------------------------------------------------------------------------+
    | Field             | Type    | Label | Comment                |
    +--------------------------------------------------------------------------------------------+
    | avg_cpu            | double   |    | [0, 1]                |
    | avg_mem            | double   |    | [0, 1]                |
    | avg_assigned_mem       | double   |    | [0, 1]                |
    | avg_cycles_per_instruction  | double   |    | [0, _]                |
    +--------------------------------------------------------------------------------------------+

    One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.


    Azure v2 virtual machine workload

    Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

    This repository dataset of instance usage includes the following columns:

    +--------------------------------------------------------------------------------------------+
    | Field             | Type    | Label | Comment                |
    +--------------------------------------------------------------------------------------------+
    | cpu_usage           | double   |    | [0, _]                |
    | assigned_mem         | double   |    | [0, _]                |
    +--------------------------------------------------------------------------------------------+

    One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp

    Access Level
    The dataset is freely accessible under an Open Access model. There are no restrictions for reuse, and it is licensed under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

  2. Data from: TIHM: An open dataset for remote healthcare monitoring in...

    • zenodo.org
    zip
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesca Palermo; Francesca Palermo; Yu Chen; Yu Chen; Alexander Capstick; Nan Fletcher-Loyd; Chloe Walsh; Samaneh Kouchaki; Samaneh Kouchaki; Jessica True; Olga Balazikova; Eyal Soreq; Gregory Scott; Helen Rostill; Ramin Nilforooshan; Ramin Nilforooshan; Payam Barnaghi; Payam Barnaghi; Alexander Capstick; Nan Fletcher-Loyd; Chloe Walsh; Jessica True; Olga Balazikova; Eyal Soreq; Gregory Scott; Helen Rostill (2023). TIHM: An open dataset for remote healthcare monitoring in dementia [Dataset]. http://doi.org/10.5281/zenodo.7622128
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesca Palermo; Francesca Palermo; Yu Chen; Yu Chen; Alexander Capstick; Nan Fletcher-Loyd; Chloe Walsh; Samaneh Kouchaki; Samaneh Kouchaki; Jessica True; Olga Balazikova; Eyal Soreq; Gregory Scott; Helen Rostill; Ramin Nilforooshan; Ramin Nilforooshan; Payam Barnaghi; Payam Barnaghi; Alexander Capstick; Nan Fletcher-Loyd; Chloe Walsh; Jessica True; Olga Balazikova; Eyal Soreq; Gregory Scott; Helen Rostill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dementia is a progressive condition that affects cognitive and functional abilities. There is a need for reliable and continuous health monitoring of People Living with Dementia (PLWD) to improve their quality of life and support their independent living. Healthcare services often focus on addressing and treating already established health conditions that affect PLWD. Managing these conditions continuously can inform better decision-making earlier for higher-quality care management for PLWD. The Technology Integrated Health Management (TIHM) project developed a new digital platform to routinely collect longitudinal observational and measurement data within the home and apply machine learning and analytical models for the detection and prediction of adverse health events affecting the well-being of PLWD. This work describes the TIHM dataset collected during the second phase (i.e., feasibility study) of the TIHM project. The data was collected from homes of 56 PLWD and associated with events and clinical observations (daily activity, physiological monitoring, and labels for health-related conditions). The study recorded an average of 50 days of data per participant, totalling 2803 days.

    We have provided raw data and guidelines on how to access, visualise, manipulate and predict health-related events within the dataset, available on the Github repository. The Jupyter Notebooks have been developed using Python 3.9.

    The dataset is provided for research and patient benefit purposes.
    Please acknowledge the Surrey and Borders Partnership NHS Foundation Trust in any publication or use of this dataset.

  3. Data from: MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryo Tanabe; Harsh Purohit; Kota Dohi; Takashi Endo; Yuki Nikaido; Toshiki Nakamura; Yohei Kawaguchi; Yohei Kawaguchi; Ryo Tanabe; Harsh Purohit; Kota Dohi; Takashi Endo; Yuki Nikaido; Toshiki Nakamura (2021). MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions [Dataset]. http://doi.org/10.5281/zenodo.4740355
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ryo Tanabe; Harsh Purohit; Kota Dohi; Takashi Endo; Yuki Nikaido; Toshiki Nakamura; Yohei Kawaguchi; Yohei Kawaguchi; Ryo Tanabe; Harsh Purohit; Kota Dohi; Takashi Endo; Yuki Nikaido; Toshiki Nakamura
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions (MIMII DUE). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, pumps, slide rails, and valves. The data for each machine type includes six subsets called ``sections'', and each section roughly corresponds to a single product. Each section consists of data from two domains, called the source domain and the target domain, with different conditions such as operating speed and environmental noise. This dataset is a subset of the dataset for DCASE 2021 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset and additional training dataset. For more information, please see this paper and the pages of the development dataset and the task description for DCASE 2021 Challenge Task 2.

    Baseline system

    Two simple baseline systems are available on the Github repositories [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

    Conditions of use

    This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

    Publication

    If you use this dataset, please cite the following paper:

    Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi, "MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions," arXiv preprint arXiv: 2105.02702, 2021. [URL]


    Feedback

    If there is any problem, please contact us:

  4. Gridded population maps of Germany from disaggregated census data and...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franz Schug; Franz Schug; David Frantz; David Frantz; Sebastian van der Linden; Patrick Hostert; Sebastian van der Linden; Patrick Hostert (2021). Gridded population maps of Germany from disaggregated census data and bottom-up estimates [Dataset]. http://doi.org/10.5281/zenodo.4601292
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 13, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Franz Schug; Franz Schug; David Frantz; David Frantz; Sebastian van der Linden; Patrick Hostert; Sebastian van der Linden; Patrick Hostert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    This dataset features three gridded population dadasets of Germany on a 10m grid. The units are people per grid cell.

    Datasets

    DE_POP_VOLADJ16: This dataset was produced by disaggregating national census counts to 10m grid cells based on a weighted dasymetric mapping approach. A building density, building height and building type dataset were used as underlying covariates, with an adjusted volume for multi-family residential buildings.

    DE_POP_TDBP: This dataset is considered a best product, based on a dasymetric mapping approach that disaggregated municipal census counts to 10m grid cells using the same three underyling covariate layers.

    DE_POP_BU: This dataset is based on a bottom-up gridded population estimate. A building density, building height and building type layer were used to compute a living floor area dataset in a 10m grid. Using federal statistics on the average living floor are per capita, this bottom-up estimate was created.

    Please refer to the related publication for details.

    Temporal extent

    The building density layer is based on Sentinel-2 time series data from 2018 and Sentinel-1 time series data from 2017 (doi: http://doi.org/10.1594/PANGAEA.920894)

    The building height layer is representative for ca. 2015 (doi: 10.5281/zenodo.4066295)

    The building types layer is based on Sentinel-2 time series data from 2018 and Sentinel-1 time series data from 2017 (doi: 10.5281/zenodo.4601219)

    The underlying census data is from 2018.

    Data format

    The data come in tiles of 30x30km (see shapefile). The projection is EPSG:3035. The images are compressed GeoTiff files (*.tif). There is a mosaic in GDAL Virtual format (*.vrt), which can readily be opened in most Geographic Information Systems.

    Further information

    For further information, please see the publication or contact Franz Schug (franz.schug@geo.hu-berlin.de).
    A web-visualization of this dataset is available here.

    Publication

    Schug, F., Frantz, D., van der Linden, S., & Hostert, P. (2021). Gridded population mapping for Germany based on building density, height and type from Earth Observation data using census disaggregation and bottom-up estimates. PLOS ONE. DOI: 10.1371/journal.pone.0249044

    Acknowledgements

    Census data were provided by the German Federal Statistical Offices.

    Funding
    This dataset was produced with funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (MAT_STOCKS, grant agreement No 741950).

  5. Data from: MIMII Dataset: Sound Dataset for Malfunctioning Industrial...

    • zenodo.org
    zip
    Updated Feb 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido (2020). MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection [Dataset]. http://doi.org/10.5281/zenodo.3384388
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 29, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].

    This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

    A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/

    *1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.

    [1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.

    [2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.

  6. 100,000 histological images of human colorectal cancer and healthy tissue

    • zenodo.org
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakob Nikolas Kather; Jakob Nikolas Kather; Niels Halama; Alexander Marx; Niels Halama; Alexander Marx (2020). 100,000 histological images of human colorectal cancer and healthy tissue [Dataset]. http://doi.org/10.5281/zenodo.1214456
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jakob Nikolas Kather; Jakob Nikolas Kather; Niels Halama; Alexander Marx; Niels Halama; Alexander Marx
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Description "NCT-CRC-HE-100K"

    • This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue.
    • All images are 224x224 pixels (px) at 0.5 microns per pixel (MPP). All images are color-normalized using Macenko's method (http://ieeexplore.ieee.org/abstract/document/5193250/, DOI 10.1109/ISBI.2009.5193250).
    • Tissue classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM).
    • These images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.

    Ethics statement "NCT-CRC-HE-100K"

    All experiments were conducted in accordance with the Declaration of Helsinki, the International Ethical Guidelines for Biomedical Research Involving Human Subjects (CIOMS), the Belmont Report and the U.S. Common Rule. Anonymized archival tissue samples were retrieved from the tissue bank of the National Center for Tumor diseases (NCT, Heidelberg, Germany) in accordance with the regulations of the tissue bank and the approval of the ethics committee of Heidelberg University (tissue bank decision numbers 2152 and 2154, granted to Niels Halama and Jakob Nikolas Kather; informed consent was obtained from all patients as part of the NCT tissue bank protocol, ethics board approval S-207/2005, renewed on 20 Dec 2017). Another set of tissue samples was provided by the pathology archive at UMM (University Medical Center Mannheim, Heidelberg University, Mannheim, Germany) after approval by the institutional ethics board (Ethics Board II at University Medical Center Mannheim, decision number 2017-806R-MA, granted to Alexander Marx and waiving the need for informed consent for this retrospective and fully anonymized analysis of archival samples).

    Data set "CRC-VAL-HE-7K"

    This is a set of 7180 image patches from N=50 patients with colorectal adenocarcinoma (no overlap with patients in NCT-CRC-HE-100K). It can be used as a validation set for models trained on the larger data set. Like in the larger data set, images are 224x224 px at 0.5 MPP. All tissue samples were provided by the NCT tissue bank, see above for further details and ethics statement.

    Data set "NCT-CRC-HE-100K-NONORM"

    This is a slightly different version of the "NCT-CRC-HE-100K" image set: This set contains 100,000 images in 9 tissue classes at 0.5 MPP and was created from the same raw data as "NCT-CRC-HE-100K". However, no color normalization was applied to these images. Consequently, staining intensity and color slightly varies between the images. Please note that although this image set was created from the same data as "NCT-CRC-HE-100K", the image regions are not completely identical because the selection of non-overlapping tiles from raw images was a stochastic process.

    General comments

    Please note that the classes are only roughly balanced. Classifiers should never be evaluated based on accuracy in the full set alone. Also, if a high risk of training bias is excepted, balancing the number of cases per class is recommended.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero (2024). DataCenter-Traces-Datasets [Dataset]. http://doi.org/10.5281/zenodo.14564935
Organization logo

DataCenter-Traces-Datasets

Explore at:
bin, csvAvailable download formats
Dataset updated
Dec 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Nov 17, 2022
Description

Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:

Alibaba 2018 machine usage

Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

This repository dataset of machine usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field      | Type    | Label | Comment                      |
+--------------------------------------------------------------------------------------------+
| cpu_util_percent | bigint   |    | [0, 100]                      |
| mem_util_percent | bigint   |    | [0, 100]                      |
| net_in      | double   |    | normarlized in coming network traffic, [0, 100]  |
| net_out     | double   |    | normarlized out going network traffic, [0, 100]  |
| disk_io_percent | double   |    | [0, 100], abnormal values are of -1 or 101     |
+--------------------------------------------------------------------------------------------+

Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

Google 2019 instance usage

Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field             | Type    | Label | Comment                |
+--------------------------------------------------------------------------------------------+
| avg_cpu            | double   |    | [0, 1]                |
| avg_mem            | double   |    | [0, 1]                |
| avg_assigned_mem       | double   |    | [0, 1]                |
| avg_cycles_per_instruction  | double   |    | [0, _]                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.


Azure v2 virtual machine workload

Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field             | Type    | Label | Comment                |
+--------------------------------------------------------------------------------------------+
| cpu_usage           | double   |    | [0, _]                |
| assigned_mem         | double   |    | [0, _]                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp

Access Level
The dataset is freely accessible under an Open Access model. There are no restrictions for reuse, and it is licensed under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

Search
Clear search
Close search
Google apps
Main menu