17 datasets found
  1. PTB XL Dataset - Reformatted

    • kaggle.com
    zip
    Updated Feb 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kun Hao Yeh (2021). PTB XL Dataset - Reformatted [Dataset]. https://www.kaggle.com/khyeh0719/ptb-xl-dataset-reformatted
    Explore at:
    zip(502936797 bytes)Available download formats
    Dataset updated
    Feb 23, 2021
    Authors
    Kun Hao Yeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.

    The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.

    Background

    The waveform data underlying the PTB-XL ECG dataset was collected with devices from Schiller AG over the course of nearly seven years between October 1989 and June 1996. With the acquisition of the original database from Schiller AG, the full usage rights were transferred to the PTB. The records were curated and converted into a structured database within a long-term project at the Physikalisch-Technische Bundesanstalt (PTB). The database was used in a number of publications, see e.g. [1,2], but the access remained restricted until now. The Institutional Ethics Committee approved the publication of the anonymous data in an open-access database (PTB-2020-1). During the public release process in 2019, the existing database was streamlined with particular regard to usability and accessibility for the machine learning community. Waveform and metadata were converted to open data formats that can easily processed by standard software.

    Method

    This dataset is generated by processing the raw dataset with this notebook.

    Dataset Details

    Files

    • train_12_lead_ecgs.pkl- ECG signals as pickled numpy format in train set.
    • valid_12_lead_ecgs.pkl- ECG signals as pickled numpy format in valid set.
    • test_12_lead_ecgs.pkl- ECG signals as pickled numpy format in test set.
    • train_table.csv- patient's meta features and ECG diagnosis in train set.
    • valid_table.csv- patient's meta features and ECG diagnosis in valid set.
    • test_table.csv- patient's meta features and ECG diagnosis in test set.

    How to work with pickle files:

    import pandas as pd
    
    train_ecgs = pd.read_pickle('train_12_lead_ecgs.pkl') 
    
    # train_ecgs is of shape (number of ECG records, 1000, 12)
    # 1000 is signal data points for each ECG record 
    # 12 stands for 12-channel from 12-lead
    

    Columns

    • ecg_id- ID used in the raw data from: https://www.kaggle.com/khyeh0719/ptb-xl-dataset and paper
    • strat_fold- stratified fold as suggested from the paper
    • age, sex, height, weight, nurse, site, device- patient's information
    • NORM- Diagnosis for normal ECG
    • MI- Diagnosis for Myocardial Infarction, a myocardial infarction (MI), commonly known as a heart attack, occurs when blood flow decreases or stops to a part of the heart, causing damage to the heart muscle.
    • STTC- Diagnosis for ST/T Change, ST and T wave changes may represent cardiac pathology or be a normal variant. Interpretation of the findings, therefore, depends on the clinical context and presence of similar findings on prior electrocardiograms
    • CD- Diagnosis for Conduction Disturbance. Your heart rhythm is the way your heartbeats. Conduction is how electrical impulses travel through your heart, which causes it to beat. Some conduction disorders can cause arrhythmias or irregular heartbeats.
    • HYP- Diagnosis for Hypertrophy, Hypertrophic cardiomyopathy (HCM) is a disease in which the heart muscle becomes abnormally thick (hypertrophied). The thickened heart muscle can make it harder for the heart to pump blood.
    • sub_- Columns with the 'sub_' prefix are more detailed diagnosis for ECG.
  2. PTB-XL ECG dataset

    • kaggle.com
    • opendatalab.com
    Updated Feb 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khyeh (2021). PTB-XL ECG dataset [Dataset]. https://www.kaggle.com/khyeh0719/ptb-xl-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    khyeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source: https://physionet.org/content/ptb-xl/1.0.1/

    Abstract

    Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.

    The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.

    Background

    The waveform data underlying the PTB-XL ECG dataset was collected with devices from Schiller AG over the course of nearly seven years between October 1989 and June 1996. With the acquisition of the original database from Schiller AG, the full usage rights were transferred to the PTB. The records were curated and converted into a structured database within a long-term project at the Physikalisch-Technische Bundesanstalt (PTB). The database was used in a number of publications, see e.g. [1,2], but the access remained restricted until now. The Institutional Ethics Committee approved the publication of the anonymous data in an open-access database (PTB-2020-1). During the public release process in 2019, the existing database was streamlined with particular regard to usability and accessibility for the machine learning community. Waveform and metadata were converted to open data formats that can easily processed by standard software.

    Methods

    Data Acquisition

    1. Raw signal data was recorded and stored in a proprietary compressed format. For all signals, we provide the standard set of 12 leads (I, II, III, AVL, AVR, AVF, V1, ..., V6) with reference electrodes on the right arm.
    2. The corresponding general metadata (such as age, sex, weight and height) was collected in a database.
    3. Each record was annotated with a report string (generated by cardiologist or automatic interpretation by ECG-device) which was converted into a standardized set of SCP-ECG statements (scp_codes). For most records also the heart’s axis (heart_axis) and infarction stadium (infarction_stadium1 and infarction_stadium2, if present) were extracted.
    4. A large fraction of the records was validated by a second cardiologist.
    5. All records were validated by a technical expert focusing mainly on signal characteristics.

    Data Preprocessing

    ECGs and patients are identified by unique identifiers (ecg_id and patient_id). Personal information in the metadata, such as names of validating cardiologists, nurses and recording site (hospital etc.) of the recording was pseudonymized. The date of birth only as age at the time of the ECG recording, where ages of more than 89 years appear in the range of 300 years in compliance with HIPAA standards. Furthermore, all ECG recording dates were shifted by a random offset for each patient. The ECG statements used for annotating the records follow the SCP-ECG standard [3].

    Data Description

    In general, the dataset is organized as follows: ptbxl ├── ptbxl_database.csv ├── scp_statements.csv ├── records100 ├── 00000 │ │ ├── 00001_lr.dat │ │ ├── 00001_lr.hea │ │ ├── ... │ │ ├── 00999_lr.dat │ │ └── 00999_lr.hea │ ├── ... │ └── 21000 │ ├── 21001_lr.dat │ ├── 21001_lr.hea │ ├── ... │ ├── 21837_lr.dat │ └── 21837_lr.hea └── records500 ├── 00000 │ ├── 00001_hr.dat │ ├── 00001_hr.hea │ ├── ... │ ├── 00999_hr.dat │ └── 00999_hr.hea ├── ... └── 21000 ├── 21001_hr.dat ├── 21001_hr.hea ├── ... ├── 21837_hr.dat └── 21837_hr.hea The dataset comprises 21837 clinical 12-lead ECG records of 10 seconds length from 18885 patients, where 52% are male and 48% are female with ages covering the whole range from 0 to 95 years (median 62 and interquantile range of 22). The value of the dataset results from the comprehensive collection of many different co-occurring path...

  3. p

    PTB-XL, a large publicly available electrocardiography dataset

    • physionet.org
    • maplerate.net
    Updated Nov 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Wojciech Samek; Tobias Schaeffter (2022). PTB-XL, a large publicly available electrocardiography dataset [Dataset]. http://doi.org/10.13026/kfzx-aw45
    Explore at:
    Dataset updated
    Nov 9, 2022
    Authors
    Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Wojciech Samek; Tobias Schaeffter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.

    The PTB-XL ECG dataset is a large dataset of 21799 clinical 12-lead ECGs from 18869 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.

  4. P

    MedalCare-XL Dataset

    • paperswithcode.com
    • zenodo.org
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MedalCare-XL Dataset [Dataset]. https://paperswithcode.com/dataset/medalcare-xl
    Explore at:
    Dataset updated
    Nov 28, 2022
    Description

    Mechanistic cardiac electrophysiology models allow for personalized simulations of the electrical activity in the heart and the ensuing electrocardiogram (ECG) on the body surface. As such, synthetic signals possess precisely known ground truth labels of the underlying disease (model parameterization) and can be employed for validation of machine learning ECG analysis tools in addition to clinical signals. Recently, synthetic ECG signals were used to enrich sparse clinical data for machine learning or even replace them completely during training leading to good performance on real-world clinical test data.

    We thus generated a large synthetic database comprising a total of 16,900 12~lead ECGs based on multi-scale electrophysiological simulations equally distributed into 1 normal healthy control and 7 pathology classes. The pathological case of myocardial infraction had 6 sub-classes. A comparison of extracted timing and amplitude features between the virtual cohort and a large publicly available clinical ECG database demonstrated that the synthetic signals represent clinical ECGs for healthy and pathological subpopulations with high fidelity. The novel dataset of simulated ECG signals is split into training, validation and test data folds for development of novel machine learning algorithms and their objective assessment.

    This folder WP2_largeDataset_Noise contains the 12-lead ECGs of 10 seconds length. Each ECG is stored in a separate CSV file with one row per lead (lead order: I, II, III, aVR, aVL, aVF, V1-V6) and one sample per column (sampling rate: 500Hz). Data are split by pathologies (avblock = AV block, lbbb = left bundle branch block, rbbb = right bundle branch block, sinus = normal sinus rhythm, lae = left atrial enlargement, fam = fibrotic atrial cardiomyopathy, iab = interatrial conduction block, mi = myocardial infarction). MI data are further split into subclasses depending on the occlusion site (LAD, LCX, RCA) and transmurality (0.3 or 1.0). Each pathology subclass contains training, validation and testing data (~ 70/15/15 split). Training, validation and testing datasets were defined according to the model with which QRST complexes were simulated, i.e., ECGs calculated with the same anatomical model but different electrophysiological parameters are only present in one of the test, validation and training datasets but never in multiple. Each subfolder also contains a "siginfo.csv" file specifying the respective simulation run for the P wave and the QRST segment that was used to synthesize the 10 second ECG segment. Each signal is available in three variations: run_raw.csv contains the synthesized ECG without added noise and without filtering runnoise.csv contains the synthesized ECG (unfiltered) with superimposed noise run*_filtered.csv contains the filtered synthesized ECG (fiter settings: highpass cutoff frequency 0.5Hz, lowpass cutoff frequency 150Hz, butterworth filters of order 3).

    The folder WP2_largeDataset_ParameterFiles contains the parameter files used to simulate the 12-lead ECGs. Parameters are split for atrial and ventricular simulations, which were run independently from one another. See Gillette, Gsell, Nagel* et al. "MedalCare-XL: 16,900 healthy and pathological electrocardiograms obtained through multi-scale electrophysiological models" for a description of the model parameters.

  5. h

    daily-historical-stock-price-data-for-destination-xl-group-inc-19872025

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Ben Ali, daily-historical-stock-price-data-for-destination-xl-group-inc-19872025 [Dataset]. https://huggingface.co/datasets/khaledxbenali/daily-historical-stock-price-data-for-destination-xl-group-inc-19872025
    Explore at:
    Authors
    Khaled Ben Ali
    Description

    📈 Daily Historical Stock Price Data for Destination XL Group, Inc. (1987–2025)

    A clean, ready-to-use dataset containing daily stock prices for Destination XL Group, Inc. from 1987-06-02 to 2025-05-28. This dataset is ideal for use in financial analysis, algorithmic trading, machine learning, and academic research.

      🗂️ Dataset Overview
    

    Company: Destination XL Group, Inc. Ticker Symbol: DXLG Date Range: 1987-06-02 to 2025-05-28 Frequency: Daily Total Records: 9571 rows… See the full description on the dataset page: https://huggingface.co/datasets/khaledxbenali/daily-historical-stock-price-data-for-destination-xl-group-inc-19872025.

  6. Prosit-XL Models Data

    • zenodo.org
    bin
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mostafa Kalhor; Mostafa Kalhor; Mathias Wilhelm; Mathias Wilhelm (2024). Prosit-XL Models Data [Dataset]. http://doi.org/10.5281/zenodo.14046213
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mostafa Kalhor; Mostafa Kalhor; Mathias Wilhelm; Mathias Wilhelm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training data used to train fragment ion intensity Prosit-XL models.

  7. Preprocessed CPSC and PTB-XL Data

    • figshare.com
    zip
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanessa Borst (2024). Preprocessed CPSC and PTB-XL Data [Dataset]. http://doi.org/10.6084/m9.figshare.25532869.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Vanessa Borst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).

  8. e

    Data from: Prosit-XL: enhanced cross-linked peptide identification by...

    • ebi.ac.uk
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mostafa Kalhor (2025). Prosit-XL: enhanced cross-linked peptide identification by accurate fragment intensity prediction to study protein-protein interactions and protein structures [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD057705
    Explore at:
    Dataset updated
    May 18, 2025
    Authors
    Mostafa Kalhor
    Variables measured
    Proteomics
    Description

    It has been shown that integrating peptide property predictions such as fragment intensity into the scoring process of peptide spectrum match can greatly increase the number of confidently identified peptides compared to using traditional scoring methods. Here, we introduce Prosit-XL, a robust and accurate fragment intensity predictor covering the cleavable (DSSO/DSBU) and non-cleavable cross-linkers (DSS/BS3), achieving high accuracy on various holdout sets with consistent performance on external datasets without fine-tuning. Due to the complex nature of false positives in XL-MS, a novel approach to data-driven rescoring was developed that benefits from Prosit-XL’s predictions while limiting the overestimation of the false discovery rate (FDR). We first evaluated this approach using two ground truth datasets (PXD029252, PXD042173) that demonstrate the accurate and precise FDR estimation. Second, we applied Prosit-XL on a proteome-scale dataset (JPST000845, PXD017711), demonstrating an up to ~3.4-fold improvement in PPI discovery compared to classic approaches. Finally, Prosit-XL was used to increase the coverage and depth of a spatially resolved interactome map of intact human cytomegalovirus virions (PXD031911), leading to the discovery of previously unobserved interactions between human and cytomegalovirus proteins.

  9. Z

    Data from: EyeFi: Fast Human Identification Through Vision and WiFi-based...

    • data.niaid.nih.gov
    Updated Dec 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiwei Fang (2022). EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3882103
    Explore at:
    Dataset updated
    Dec 4, 2022
    Dataset provided by
    Shahriar Nirjon
    Sirajum Munir
    Tamzeed Islam
    Shiwei Fang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EyeFi Dataset

    This dataset is collected as a part of the EyeFi project at Bosch Research and Technology Center, Pittsburgh, PA, USA. The dataset contains WiFi CSI values of human motion trajectories along with ground truth location information captured through a camera. This dataset is used in the following paper "EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching" that is published in the IEEE International Conference on Distributed Computing in Sensor Systems 2020 (DCOSS '20). We also published a dataset paper titled as "Dataset: Person Tracking and Identification using Cameras and Wi-Fi Channel State Information (CSI) from Smartphones" in Data: Acquisition to Analysis 2020 (DATA '20) workshop describing details of data collection. Please check it out for more information on the dataset.

    Clarification/Bug report: Please note that the order of antennas and subcarriers in .h5 files is not written clearly in the README.md file. The order of antennas and subcarriers are as follows for the 90 csi_real and csi_imag values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. Please see the description below. The newer version of the dataset contains this information in README.md. We are sorry for the inconvenience.

    Data Collection Setup

    In our experiments, we used Intel 5300 WiFi Network Interface Card (NIC) installed in an Intel NUC and Linux CSI tools [1] to extract the WiFi CSI packets. The (x,y) coordinates of the subjects are collected from Bosch Flexidome IP Panoramic 7000 panoramic camera mounted on the ceiling and Angle of Arrivals (AoAs) are derived from the (x,y) coordinates. Both the WiFi card and camera are located at the same origin coordinates but at different height, the camera is location around 2.85m from the ground and WiFi antennas are around 1.12m above the ground.

    The data collection environment consists of two areas, first one is a rectangular space measured 11.8m x 8.74m, and the second space is an irregularly shaped kitchen area with maximum distances of 19.74m and 14.24m between two walls. The kitchen also has numerous obstacles and different materials that pose different RF reflection characteristics including strong reflectors such as metal refrigerators and dishwashers.

    To collect the WiFi data, we used a Google Pixel 2 XL smartphone as an access point and connect the Intel 5300 NIC to it for WiFi communication. The transmission rate is about 20-25 packets per second. The same WiFi card and phone are used in both lab and kitchen area.

    List of Files Here is a list of files included in the dataset:

    |- 1_person |- 1_person_1.h5 |- 1_person_2.h5 |- 2_people |- 2_people_1.h5 |- 2_people_2.h5 |- 2_people_3.h5 |- 3_people |- 3_people_1.h5 |- 3_people_2.h5 |- 3_people_3.h5 |- 5_people |- 5_people_1.h5 |- 5_people_2.h5 |- 5_people_3.h5 |- 5_people_4.h5 |- 10_people |- 10_people_1.h5 |- 10_people_2.h5 |- 10_people_3.h5 |- Kitchen |- 1_person |- kitchen_1_person_1.h5 |- kitchen_1_person_2.h5 |- kitchen_1_person_3.h5 |- 3_people |- kitchen_3_people_1.h5 |- training |- shuffuled_train.h5 |- shuffuled_valid.h5 |- shuffuled_test.h5 View-Dataset-Example.ipynb README.md

    In this dataset, folder 1_person/ , 2_people/ , 3_people/ , 5_people/, and 10_people/ contains data collected from the lab area whereas Kitchen/ folder contains data collected from the kitchen area. To see how the each file is structured, please see below in section Access the data.

    The training folder contains the training dataset we used to train the neural network discussed in our paper. They are generated by shuffling all the data from 1_person/ folder collected in the lab area (1_person_1.h5 and 1_person_2.h5).

    Why multiple files in one folder?

    Each folder contains multiple files. For example, 1_person folder has two files: 1_person_1.h5 and 1_person_2.h5. Files in the same folder always have the same number of human subjects present simultaneously in the scene. However, the person who is holding the phone can be different. Also, the data could be collected through different days and/or the data collection system needs to be rebooted due to stability issue. As result, we provided different files (like 1_person_1.h5, 1_person_2.h5) to distinguish different person who is holding the phone and possible system reboot that introduces different phase offsets (see below) in the system.

    Special note:

    For 1_person_1.h5, this file is generated by the same person who is holding the phone, and 1_person_2.h5 contains different people holding the phone but only one person is present in the area at a time. Boths files are collected in different days as well.

    Access the data To access the data, hdf5 library is needed to open the dataset. There are free HDF5 viewer available on the official website: https://www.hdfgroup.org/downloads/hdfview/. We also provide an example Python code View-Dataset-Example.ipynb to demonstrate how to access the data.

    Each file is structured as (except the files under "training/" folder):

    |- csi_imag |- csi_real |- nPaths_1 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_2 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_3 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_4 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- num_obj |- obj_0 |- cam_aoa |- coordinates |- obj_1 |- cam_aoa |- coordinates ... |- timestamp

    The csi_real and csi_imag are the real and imagenary part of the CSI measurements. The order of antennas and subcarriers are as follows for the 90 csi_real and csi_imag values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. nPaths_x group are SpotFi [2] calculated WiFi Angle of Arrival (AoA) with x number of multiple paths specified during calculation. Under the nPath_x group are offset_xx subgroup where xx stands for the offset combination used to correct the phase offset during the SpotFi calculation. We measured the offsets as:

    AntennasOffset 1 (rad)Offset 2 (rad)
    1 & 21.1899-2.0071
    1 & 31.3883-1.8129

    The measurement is based on the work [3], where the authors state there are two possible offsets between two antennas which we measured by booting the device multiple times. The combination of the offset are used for the offset_xx naming. For example, offset_12 is offset 1 between antenna 1 & 2 and offset 2 between antenna 1 & 3 are used in the SpotFi calculation.

    The num_obj field is used to store the number of human subjects present in the scene. The obj_0 is always the subject who is holding the phone. In each file, there are num_obj of obj_x. For each obj_x1, we have the coordinates reported from the camera and cam_aoa, which is estimated AoA from the camera reported coordinates. The (x,y) coordinates and AoA listed here are chronologically ordered (except the files in the training folder) . It reflects the way the person carried the phone moved in the space (for obj_0) and everyone else walked (for other obj_y, where y > 0).

    The timestamp is provided here for time reference for each WiFi packets.

    To access the data (Python):

    import h5py

    data = h5py.File('3_people_3.h5','r')

    csi_real = data['csi_real'][()] csi_imag = data['csi_imag'][()]

    cam_aoa = data['obj_0/cam_aoa'][()] cam_loc = data['obj_0/coordinates'][()]

    For file inside training/ folder:

    Files inside training folder has a different data structure:

    |- nPath-1 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-2 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-3 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-4 |- aoa |- csi_imag |- csi_real |- spotfi

    The group nPath-x is the number of multiple path specified during the SpotFi calculation. aoa is the camera generated angle of arrival (AoA) (can be considered as ground truth), csi_image and csi_real is the imaginary and real component of the CSI value. spotfi is the SpotFi calculated AoA values. The SpotFi values are chosen based on the lowest median and mean error from across 1_person_1.h5 and 1_person_2.h5. All the rows under the same nPath-x group are aligned (i.e., first row of aoa corresponds to the first row of csi_imag, csi_real, and spotfi. There is no timestamp recorded and the sequence of the data is not chronological as they are randomly shuffled from the 1_person_1.h5 and 1_person_2.h5 files.

    Citation If you use the dataset, please cite our paper:

    @inproceedings{eyefi2020, title={EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching}, author={Fang, Shiwei and Islam, Tamzeed and Munir, Sirajum and Nirjon, Shahriar}, booktitle={2020 IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS)},

  10. S

    Data from: Visibility Region Spatial Distribution Dataset for XL-MIMO Arrays...

    • scidb.cn
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dian zi yu xin xi xue bao (2024). Visibility Region Spatial Distribution Dataset for XL-MIMO Arrays [Dataset]. http://doi.org/10.57760/sciencedb.18626
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Science Data Bank
    Authors
    dian zi yu xin xi xue bao
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Abstract: The Visibility Region (VR) information can be used to reduce the complexity in transmission design of EXtremely Large-scale massive Multiple-Input Multiple-Output (XL-MIMO) systems. Existing theoretical analysis and transmission design are mostly based on simplified VR models. In order to evaluate and analyze the performance of XL-MIMO in realistic propagation scenarios, this paper discloses a VR spatial distribution dataset for XL-MIMO systems, which is constructed by steps including environmental parameter setting, ray tracing simulation, field strength data preprocessing and VR determination. For typical urban scenarios, the dataset establishes the connections between user locations, field strength data, and VR data, with a total number of hundreds of millions of data entries. Furthermore, the VR distribution is visualized and analyzed, and a VR-based XL-MIMO user access protocol is taken as an example usecase, with its performance being evaluated with the proposed VR dataset.

  11. P

    PhysioNet Challenge 2020 Dataset

    • paperswithcode.com
    Updated Dec 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna (2020). PhysioNet Challenge 2020 Dataset [Dataset]. https://paperswithcode.com/dataset/physionet-challenge-2020
    Explore at:
    Dataset updated
    Dec 30, 2020
    Authors
    Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna
    Description

    Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.

    The second source set is the public dataset from St Petersburg INCART 12-lead Arrhythmia Database. This database consists of 74 annotated recordings extracted from 32 Holter records. Each record is 30 minutes long and contains 12 standard leads, each sampled at 257 Hz.

    The third source from the Physikalisch Technische Bundesanstalt (PTB) comprises two public databases: the PTB Diagnostic ECG Database and the PTB-XL, a large publicly available electrocardiography dataset. The first PTB database contains 516 records (male: 377, female: 139). Each recording was sampled at 1000 Hz. The PTB-XL contains 21,837 clinical 12-lead ECGs (male: 11,379 and female: 10,458) of 10 second length with a sampling frequency of 500 Hz.

    The fourth source is a Georgia database which represents a unique demographic of the Southeastern United States. This training set contains 10,344 12-lead ECGs (male: 5,551, female: 4,793) of 10 second length with a sampling frequency of 500 Hz.

    The fifth source is an undisclosed American database that is geographically distinct from the Georgia database. This source contains 10,000 ECGs (all retained as test data).

    All data is provided in WFDB format. Each ECG recording has a binary MATLAB v4 file (see page 27) for the ECG signal data and a text file in WFDB header format describing the recording and patient attributes, including the diagnosis (the labels for the recording). The binary files can be read using the load function in MATLAB and the scipy.io.loadmat function in Python; please see our baseline models for examples of loading the data. The first line of the header provides information about the total number of leads and the total number of samples or points per lead. The following lines describe how each lead was saved, and the last lines provide information on demographics and diagnosis. Below is an example header file A0001.hea:

    A0001 12 500 7500 05-Feb-2020 11:39:16
    A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I
    A0001.mat 16+24 1000/mV 16 0 7 2029 0 II
    A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III
    A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR
    A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL
    A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF
    A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1
    A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2
    A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3
    A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4
    A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5
    A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6
    
    Age: 74
    Sex: Male
    Dx: 426783006
    Rx: Unknown
    Hx: Unknown
    Sx: Unknown
    

    From the first line, we see that the recording number is A0001, and the recording file is A0001.mat. The recording has 12 leads, each recorded at 500 Hz sample frequency, and contains 7500 samples. From the next 12 lines, we see that each signal was written at 16 bits with an offset of 24 bits, the amplitude resolution is 1000 with units in mV, the resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. The first value of the signal, the checksum, and the lead name are included for each signal. From the final 6 lines, we see that the patient is a 74-year-old male with a diagnosis (Dx) of 426783006. The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown.

    Each ECG recording has one or more labels from different type of abnormalities in SNOMED-CT codes. The full list of diagnoses for the challenge has been posted here as a 3 column CSV file: Long-form description, corresponding SNOMED-CT code, abbreviation. Although these descriptions apply to all training data there may be fewer classes in the test data, and in different proportions. However, every class in the test data will be represented in the training data.

  12. n

    Protein Cross-Linking Database

    • neuinfo.org
    • scicrunch.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Protein Cross-Linking Database [Dataset]. http://identifiers.org/RRID:SCR_021027/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Web application and database designed for sharing, visualizing, and analyzing protein cross-linking mass spectrometry data with emphasis on structural analysis and quality control. Includes public and private data sharing capabilities, project based interface designed to ensure security and facilitate collaboration among multiple researchers. Used for private collaboration and public data dissemination.

  13. T

    Telecommunication Industry in Indonesia Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Telecommunication Industry in Indonesia Report [Dataset]. https://www.marketreportanalytics.com/reports/telecommunication-industry-in-indonesia-91415
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Indonesia, Global
    Variables measured
    Market Size
    Description

    The Indonesian telecommunications market, valued at $17.13 billion in 2025, exhibits robust growth potential, driven by increasing smartphone penetration, rising internet usage, and the expanding adoption of digital services. A Compound Annual Growth Rate (CAGR) of 5.76% is projected from 2025 to 2033, indicating a significant market expansion. Key growth drivers include the increasing demand for high-speed mobile broadband, the proliferation of over-the-top (OTT) platforms and pay-TV services, and government initiatives promoting digital infrastructure development. The market is segmented into Voice Services (wired and wireless), Data services, and OTT/Pay TV services, with significant competition among major players such as Telkom Indonesia, Indosat Ooredoo, and XL Axiata. These companies are strategically investing in network infrastructure upgrades, 5G deployment, and the development of innovative digital solutions to cater to the evolving consumer needs. While challenges such as infrastructure limitations in remote areas and regulatory hurdles exist, the overall market outlook remains positive, fueled by Indonesia's burgeoning digital economy and its large and young population. The competitive landscape is intense, with both established players and new entrants vying for market share. Differentiation strategies involve offering bundled packages, competitive pricing, and improving network quality and coverage. The increasing adoption of cloud-based services and the growing demand for enhanced cybersecurity solutions will also shape market dynamics in the coming years. The strong focus on digital transformation across various sectors will continue to fuel demand for advanced telecommunication services. Geographic expansion within Indonesia, particularly reaching underserved areas, and strategic partnerships will be crucial for sustained growth in the sector. Market penetration of 5G technology will be a significant factor influencing future market growth. The expansion of e-commerce and the government's focus on digitalization are predicted to boost the demand for data services considerably throughout the forecast period. Recent developments include: March 2024: NEC Indonesia announced that it had signed a Memorandum of Understanding (MoU) with Telkom Indonesia to collaborate on developing smart cities in the new capital city of Ibu Kota Nusantara (IKN) and other cities in Indonesia. Under the MoU, Telkom Indonesia and NEC agreed to formulate a strategy, create a roadmap, design the architecture, and develop an implementation plan for smart city projects in Nusantara.January 2024: Aviat Networks Inc., a provider of wireless transport and access solutions, announced a strategic collaboration with PT Smartfren Telecom Tbk. This partnership was established to offer high-speed, ultra-reliable wireless connectivity, including private wireless networks for indoor and outdoor environments. Additionally, the collaboration aims to deliver industry digitalization and automation services to private network clients throughout Indonesia.. Key drivers for this market are: Increased Pace of 5G Roll Out, Digital Transformation Boosting Telecom. Potential restraints include: Increased Pace of 5G Roll Out, Digital Transformation Boosting Telecom. Notable trends are: Increased Pace of 5G Roll-out Driving the Market.

  14. k

    DXLG Destination XL Group Inc. Common Stock (Forecast)

    • kappasignal.com
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KappaSignal (2023). DXLG Destination XL Group Inc. Common Stock (Forecast) [Dataset]. https://www.kappasignal.com/2022/12/dxlg-destination-xl-group-inc-common.html
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    KappaSignal
    License

    https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html

    Description

    This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

    DXLG Destination XL Group Inc. Common Stock

    Financial data:

    • Historical daily stock prices (open, high, low, close, volume)

    • Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

    • Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

    Machine learning features:

    • Feature engineering based on financial data and technical indicators

    • Sentiment analysis data from social media and news articles

    • Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

    Potential Applications:

    • Stock price prediction

    • Portfolio optimization

    • Algorithmic trading

    • Market sentiment analysis

    • Risk management

    Use Cases:

    • Researchers investigating the effectiveness of machine learning in stock market prediction

    • Analysts developing quantitative trading Buy/Sell strategies

    • Individuals interested in building their own stock market prediction models

    • Students learning about machine learning and financial applications

    Additional Notes:

    • The dataset may include different levels of granularity (e.g., daily, hourly)

    • Data cleaning and preprocessing are essential before model training

    • Regular updates are recommended to maintain the accuracy and relevance of the data

  15. Most popular mobile internet provider to access the internet Indonesia 2019

    • statista.com
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most popular mobile internet provider to access the internet Indonesia 2019 [Dataset]. https://www.statista.com/statistics/1038212/indonesia-smartphone-brands-for-internet-access/
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 9, 2019 - Apr 4, 2019
    Area covered
    Indonesia
    Description

    According to a survey conducted in Indonesia in April 2019, ** percent of respondents stated that they used Telkomsel as their mobile internet provider to browse the internet. Indosat and XL were also popular mobile internet providers in Indonesia among the respondents.

    Indonesia is one of the biggest online markets worldwide. As of March 2017, online penetration in the country stood at only slightly over ** percent. Popular online activities include mobile messaging and social media.

  16. f

    S1 Data -

    • plos.figshare.com
    xlsx
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tamrat Anbesaw; Yosef Zenebe; Mogessie Necho; Moges Gebresellassie; Tesfaye Segon; Fasikaw Kebede; Tilahun Bete (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0288597.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Tamrat Anbesaw; Yosef Zenebe; Mogessie Necho; Moges Gebresellassie; Tesfaye Segon; Fasikaw Kebede; Tilahun Bete
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundDepression is the most common cause of disability in the world, which affects 350 million people. University students struggle to cope with stressors that are typical of higher education institutions as well as anxiety related to education. Although evidence indicates that they have a high prevalence of depression, no reviews have been done to determine the prevalence of depression among students at Ethiopian universities comprehensively.MethodsWithout regard to time constraints, PubMed, Scopus, and EMBASE were investigated. A manual search for an article reference list was also conducted. The Meta XL software was used to extract relevant data, and the Stata-11 meta-prop package was used to analyze it. The Higgs I2 test was used to test for heterogeneity.ResultsA search of the electronic and manual systems resulted in 940 articles. Data were extracted from ten studies included in this review involving a total number of 5207 university students. The pooled prevalence of depression was 28.13% (95% CI: 22.67, 33.59). In the sub-group analysis, the average prevalence was higher in studies having a lower sample size (28.42%) than studies with a higher sample; 27.70%, and studies that utilized other (PHQ-9, HADS); 30.67% higher than studies that used BDI-II; 26.07%. Being female (pooled AOR = 5.56) (95% CI: 1.51, 9.61), being a first-year (pooled AOR = 4.78) (95% CI: 2.21, 7.36), chewing khat (pooled AOR = 2.83) (95% CI: 2.32, 3.33), alcohol use (pooled AOR = 3.12 (95% CI:3.12, 4.01) and family history of mental illness (pooled AOR = 2.57 (95% CI:2.00, 3.15) were factors significantly associated with depression.ConclusionThis systematic review and meta-analysis revealed that more than one-fourth of students at Ethiopian universities had depression. More efforts need to be done to provide better mental healthcare to university students in Ethiopia.

  17. PMcardio ECG Image Database (PM-ECG-ID): A Diverse ECG Database for...

    • zenodo.org
    zip
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrej Iring; Viera Krešňáková; Viera Krešňáková; Michal Hojcka; Vladimir Boza; Adam Rafajdus; Boris Vavrik; Andrej Iring; Michal Hojcka; Vladimir Boza; Adam Rafajdus; Boris Vavrik (2024). PMcardio ECG Image Database (PM-ECG-ID): A Diverse ECG Database for Evaluating Digitization Solutions [Dataset]. http://doi.org/10.5281/zenodo.13617673
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrej Iring; Viera Krešňáková; Viera Krešňáková; Michal Hojcka; Vladimir Boza; Adam Rafajdus; Boris Vavrik; Andrej Iring; Michal Hojcka; Vladimir Boza; Adam Rafajdus; Boris Vavrik
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    The dataset presents the collection of a diverse electrocardiogram (ECG) database for testing and evaluating ECG digitization solutions. The Powerful Medical ECG image database was curated using 100 ECG waveforms selected from the PTB-XL Digital Waveform Database and various images generated from the base waveforms with varying lead visibility and real-world paper deformations, including the use of different mobile phones, bends, crumbles, scans, and photos of computer screens with ECGs. The ECG waveforms were augmented using various techniques, including changes in contrast, brightness, perspective transformation, rotation, image blur, JPEG compression, and resolution change. This extensive approach yielded 6,000 unique entries, which provides a wide range of data variance and extreme cases to evaluate the limitations of ECG digitization solutions and improve their performance, and serves as a benchmark to evaluate ECG digitization solutions.

    PM-ECG-ID database contains electrocardiogram (ECG) images and their corresponding ECG information. The data records are organized in a hierarchical folder structure, which includes metadata, waveform data, and visual data folders. The contents of each folder are described below:

    • metadata.csv:
      This file serves as a key-to-key bridge between the image data and the corresponding ECG information. It contains the following columns:
      • Image name: image name with extension,
      • ECG ID: this ID corresponds to the specific ECG identifier from the original PTB-XL dataset. Under this ID you can find a cutout array in the leads.npz and rhythms.npz,
      • Image relative path: relative path to the image in question,
      • Image page: page number of the particular image (starting from 0),
      • ECG number of pages: number of pages in the whole ECG,
      • ECG number of columns per page: number of columns per page in the ECG,
      • ECG number of rows per page: number of rows in the ECG,
      • ECG number of rhythm leads: number of rhythms in the ECG,
      • ECG format: short version of the ECG format.
    • data folder:
      • leads.npz: NPZ file containing all underlying cutout lead signals; each signal is there under its ECG ID.
      • rhythms.npz: NPZ file containing all underlying rhythm signals; each signal is there under its ECG ID. If no rhythm lead is in the ECG, you will find an empty array in the NPZ.
    • visual_data folder:
      This folder contains subfolders for various image data, including augmented photos and visualization and different types of photos of ECG printouts. The subfolders are organized based on the specific augmentation or type of photograph. These folders contain images with various augmentation settings, such as different levels of blur, brightness, contrast, padding, perspective transformation, resolution scaling, and rotation. The database is organized in a way that allows for easy navigation and understanding of the different augmentations applied to the image data. Each of these subfolders contains images relevant to the specific augmentation or type of photograph. The metadata.csv file provides a direct link to each image and its associated ECG information.
  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kun Hao Yeh (2021). PTB XL Dataset - Reformatted [Dataset]. https://www.kaggle.com/khyeh0719/ptb-xl-dataset-reformatted
Organization logo

PTB XL Dataset - Reformatted

PTB-XL, a large publicly available electrocardiography dataset

Explore at:
zip(502936797 bytes)Available download formats
Dataset updated
Feb 23, 2021
Authors
Kun Hao Yeh
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract

Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.

The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.

Background

The waveform data underlying the PTB-XL ECG dataset was collected with devices from Schiller AG over the course of nearly seven years between October 1989 and June 1996. With the acquisition of the original database from Schiller AG, the full usage rights were transferred to the PTB. The records were curated and converted into a structured database within a long-term project at the Physikalisch-Technische Bundesanstalt (PTB). The database was used in a number of publications, see e.g. [1,2], but the access remained restricted until now. The Institutional Ethics Committee approved the publication of the anonymous data in an open-access database (PTB-2020-1). During the public release process in 2019, the existing database was streamlined with particular regard to usability and accessibility for the machine learning community. Waveform and metadata were converted to open data formats that can easily processed by standard software.

Method

This dataset is generated by processing the raw dataset with this notebook.

Dataset Details

Files

  • train_12_lead_ecgs.pkl- ECG signals as pickled numpy format in train set.
  • valid_12_lead_ecgs.pkl- ECG signals as pickled numpy format in valid set.
  • test_12_lead_ecgs.pkl- ECG signals as pickled numpy format in test set.
  • train_table.csv- patient's meta features and ECG diagnosis in train set.
  • valid_table.csv- patient's meta features and ECG diagnosis in valid set.
  • test_table.csv- patient's meta features and ECG diagnosis in test set.

How to work with pickle files:

import pandas as pd

train_ecgs = pd.read_pickle('train_12_lead_ecgs.pkl') 

# train_ecgs is of shape (number of ECG records, 1000, 12)
# 1000 is signal data points for each ECG record 
# 12 stands for 12-channel from 12-lead

Columns

  • ecg_id- ID used in the raw data from: https://www.kaggle.com/khyeh0719/ptb-xl-dataset and paper
  • strat_fold- stratified fold as suggested from the paper
  • age, sex, height, weight, nurse, site, device- patient's information
  • NORM- Diagnosis for normal ECG
  • MI- Diagnosis for Myocardial Infarction, a myocardial infarction (MI), commonly known as a heart attack, occurs when blood flow decreases or stops to a part of the heart, causing damage to the heart muscle.
  • STTC- Diagnosis for ST/T Change, ST and T wave changes may represent cardiac pathology or be a normal variant. Interpretation of the findings, therefore, depends on the clinical context and presence of similar findings on prior electrocardiograms
  • CD- Diagnosis for Conduction Disturbance. Your heart rhythm is the way your heartbeats. Conduction is how electrical impulses travel through your heart, which causes it to beat. Some conduction disorders can cause arrhythmias or irregular heartbeats.
  • HYP- Diagnosis for Hypertrophy, Hypertrophic cardiomyopathy (HCM) is a disease in which the heart muscle becomes abnormally thick (hypertrophied). The thickened heart muscle can make it harder for the heart to pump blood.
  • sub_- Columns with the 'sub_' prefix are more detailed diagnosis for ECG.
Search
Clear search
Close search
Google apps
Main menu