100+ datasets found
  1. News Category Dataset

    • kaggle.com
    zip
    Updated Sep 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/
    Explore at:
    zip(27829769 bytes)Available download formats
    Dataset updated
    Sep 24, 2022
    Authors
    Rishabh Misra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    ** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

    This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

    Content

    Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

    There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

    • POLITICS: 35602

    • WELLNESS: 17945

    • ENTERTAINMENT: 17362

    • TRAVEL: 9900

    • STYLE & BEAUTY: 9814

    • PARENTING: 8791

    • HEALTHY LIVING: 6694

    • QUEER VOICES: 6347

    • FOOD & DRINK: 6340

    • BUSINESS: 5992

    • COMEDY: 5400

    • SPORTS: 5077

    • BLACK VOICES: 4583

    • HOME & LIVING: 4320

    • PARENTS: 3955

    Citation

    If you're using this dataset for your work, please cite the following articles:

    Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

    Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

    Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

    Acknowledgements

    This dataset was collected from HuffPost.

    Inspiration

    • Can you categorize news articles based on their headlines and short descriptions?

    • Do news articles from different categories have different writing styles?

    • A classifier trained on this dataset could be used on a free text to identify the type of language being used.

    Want to contribute your own datasets?

    If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

    Other datasets

    Please also checkout the following datasets collected by me:

  2. Z

    Dataset for Machine Learning Assisted Citation Screening for Systematic...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Dec 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhrangadhariya, Anjani (2023). Dataset for Machine Learning Assisted Citation Screening for Systematic Reviews [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10423426
    Explore at:
    Dataset updated
    Dec 22, 2023
    Dataset provided by
    Hilfiker, Roger
    Dhrangadhariya, Anjani
    Müller, Henning
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The work "Machine Learning Assisted Citation Screening for Systematic Reviews" explored the problem of citation screening automation using machine-learning (ML) with an aim to accelerate the process of generating systematic reviews. Manual process of citation screening involve two reviewers manually screening the searched studies using a predefined inclusion criteria. If the study passes the "inclusion" criteria, it is included for further analysis or is excluded. As apparant through manual screening process, the work considered citation screening as a binary classification problem whereby any ML classifier could be trained to separate the searched studies into these two classes (include and exclude).

    A physiotherapy citation screening dataset was used to test automation approaches and the dataset includes the studies identified for citation screening in an update to the systematic review by Hilfiker et al. The dataset included titles and abstracts (citations) from 31,279 (deduplicated: 25,540) studies identified during the search phase of this SR. These studies were already manually assessed for relevance and labelled by two reviewers into two mutually exclusive labels. The uploaded file consists of 25,540 data samples, with each data sample separated by a new line. It is a tab separated file and the data in it is structured as shown below. This dataset was manually labelled into include and exclude by Hilfiker et al.

    Title PMID Abstract Class MeSH terms (separated by a pipe)

    Structured exercise improves physical functioning in women with stages I and II breast cancer: results of a randomized controlled trial.
    11157015 Abstract PURPOSE: Self-directed and supervised exercise were compared with usual care in a clinical trial designed to evaluate the effect of structured exercise on physical functioning and other dimensions of health-related quality of life in women with stages I and II breast cancer. PATIENTS AND METHODS: One hundred twenty-three women with stages I and II breast cancer completed baseline evaluations of generic and disease- and site-specific health-related quality of life, aerobic capacity, and body weight. Participants were randomly allocated to one of three intervention groups: usual care (control group), self-directed exercise, or supervised exercise. Quality of life, aerobic capacity, and body weight measures were repeated at 26 weeks... include or exclude Clinical Trial | Comparative Study | Randomized Controlled Trial | Research Support, Non-U.S. Gov't | Antineoplastic Combined Chemotherapy Protocols | Breast Neoplasms | Breast Neoplasms | Breast Neoplasms | Chemotherapy, Adjuvant | Exercise | Female | Humans | Middle Aged | Neoplasm Staging | Quality of Life | Radiotherapy, Adjuvant

    If you use this dataset in your research, please cite our papers.

  3. o

    PhishingWebsites

    • openml.org
    Updated Feb 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae) (2016). PhishingWebsites [Dataset]. https://www.openml.org/d/4534
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2016
    Authors
    Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae)
    Description

    Author: Rami Mustafa A Mohammad ( University of Huddersfield","rami.mohammad '@' hud.ac.uk","rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield","t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai","fadi '@' cud.ac.ae)
    Source: UCI
    Please cite: Please refer to the Machine Learning Repository's citation policy

    Source:

    Rami Mustafa A Mohammad ( University of Huddersfield, rami.mohammad '@' hud.ac.uk, rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield,t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai,fadi '@' cud.ac.ae)

    Data Set Information:

    One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.

    Attribute Information:

    For Further information about the features see the features file in the data folder of UCI.

    Relevant Papers:

    Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0

    Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

    Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709

    Citation Request:

    Please refer to the Machine Learning Repository's citation policy

  4. UCI ML Parkinsons dataset

    • kaggle.com
    zip
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elnaz Alikarami (2025). UCI ML Parkinsons dataset [Dataset]. https://www.kaggle.com/datasets/elnazalikarami/uci-ml-parkinsons-dataset
    Explore at:
    zip(316796 bytes)Available download formats
    Dataset updated
    Jul 8, 2025
    Authors
    Elnaz Alikarami
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository

    dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons

    Dataset Characteristics Multivariate

    Subject Area Health and Medicine

    Associated Tasks Classification

    Feature Type Real

    Instances

    197

    Features

    22

    Dataset Information Additional Information

    This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

    The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

    Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

    Has Missing Values?

    No

  5. JARVIS ML Training Data

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamal Choudhary; Brian DeCost; Francesca Tavazza; Hacking Materials (2023). JARVIS ML Training Data [Dataset]. http://doi.org/10.6084/m9.figshare.7261598.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Kamal Choudhary; Brian DeCost; Francesca Tavazza; Hacking Materials
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Various properties of 24,759 bulk and 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database. This dataset was modified from the JARVIS ML training set developed by NIST (1-2). The custom descriptors have been removed, the column naming scheme revised, and a composition column created. This leaves the training set as a dataset of composition and structure descriptors mapped to a diverse set of materials properties.Available as Monty Encoder encoded JSON and as the source Monty Encoder encoded JSON file. Recommended access method is with the matminer Python package using the datasets module.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset discussed in: Machine learning with force-field-inspired descriptors for materials: Fast screening and mapping energy landscape Kamal Choudhary, Brian DeCost, and Francesca Tavazza Phys. Rev. Materials 2, 083801Original Data file sourced from:choudhary, kamal (2018): JARVIS-ML-CFID-descriptors and material properties. figshare. Dataset.

  6. o

    mushroom

    • openml.org
    Updated Apr 6, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Schlimmer (2014). mushroom [Dataset]. https://www.openml.org/d/24
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2014
    Authors
    Jeff Schlimmer
    Description

    Author: Jeff Schlimmer
    Source: UCI - 1981
    Please cite: The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

    Description

    This dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.

    Source

    (a) Origin: 
    Mushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf 
    
    (b) Donor: 
    Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)
    

    Dataset description

    This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

    Attributes Information

    1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
    2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
    3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
    4. bruises?: bruises=t,no=f 
    5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
    6. gill-attachment: attached=a,descending=d,free=f,notched=n 
    7. gill-spacing: close=c,crowded=w,distant=d 
    8. gill-size: broad=b,narrow=n 
    9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
    10. stalk-shape: enlarging=e,tapering=t 
    11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
    14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
    15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
    16. veil-type: partial=p,universal=u 
    17. veil-color: brown=n,orange=o,white=w,yellow=y 
    18. ring-number: none=n,one=o,two=t 
    19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
    20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
    21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
    22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
    

    Relevant papers

    Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine.

    Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann.

    Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link]

    Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.

  7. 3M+ Academic Papers: Titles & Abstracts

    • kaggle.com
    zip
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Arias (2025). 3M+ Academic Papers: Titles & Abstracts [Dataset]. https://www.kaggle.com/datasets/beta3logic/3m-academic-papers-titles-and-abstracts
    Explore at:
    zip(1478156333 bytes)Available download formats
    Dataset updated
    Sep 18, 2025
    Authors
    David Arias
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Comprehensive Academic Papers Dataset: 3M+ Research Paper Titles and Abstracts

    📋 Overview

    This dataset is a comprehensive collection of over 3 million research paper titles and abstracts, curated and consolidated from multiple high-quality academic sources. The dataset provides a unified, clean, and standardized format for researchers, data scientists, and machine learning practitioners working on natural language processing, academic research analysis, and knowledge discovery tasks.

    🎯 Key Features

    • 3.6+ million scientific papers with titles and abstracts
    • Multi-domain coverage: Physics, Mathematics, Computer Science, Biology, Medicine, and more
    • Standardized format: Consistent title and abstract columns
    • Quality assured: Validated using Pydantic models and cleaned of duplicates/null values
    • Ready-to-use: Pre-processed and formatted for immediate analysis
    • Format: CSV
    • Language: English

    📊 Dataset Statistics

    MetricValue
    Total Records~3,000,000+
    Columns2 (title, abstract)
    File Size4.15 GB
    FormatCSV
    DuplicatesRemoved
    Missing ValuesRemoved

    🗂️ Dataset Structure

    cleaned_papers.csv
    ├── title (string): Scientific paper title
    └── abstract (string): Scientific paper abstract
    

    🔄 Data Processing Pipeline

    The dataset underwent a rigorous cleaning and standardization process:

    1. Data Import: Automated import from multiple sources (Kaggle API, Hugging Face)
    2. Column Standardization: Mapping various column names to consistent title and abstract format
    3. Data Validation: Pydantic model validation ensuring data quality
    4. Duplicate Removal: Advanced deduplication based on title and abstract similarity
    5. Null Value Handling: Removal of records with missing titles or abstracts
    6. Quality Assurance: Final validation and statistics generation

    💡 Use Cases

    This dataset is ideal for:

    • Natural Language Processing: Text classification, sentiment analysis, topic modeling
    • Scientific Literature Analysis: Trend analysis, domain classification, citation prediction
    • Machine Learning Research: Training language models, text summarization, information extraction
    • Academic Research: Bibliometric analysis, research trend identification
    • Educational Applications: Building search engines, recommendation systems

    🔗 Data Sources and Attribution

    This dataset consolidates academic papers from the following sources:

    Kaggle Datasets:

    1. ArXiv Scientific Research Papers Dataset by @sumitm004
    2. Cornell University ArXiv Dataset by @Cornell-University

    Hugging Face Datasets:

    1. ML-ArXiv-Papers by @CShorten
    2. ArXiv Biology by @zeroshot
    3. ArXiv Data Extended by @wrapper228
    4. Stroke PubMed Abstracts by @Gaborandi
    5. PubMed ArXiv Abstracts Data by @brainchalov
    6. Abstracts Cleaned by @Eitanli

    🔄 Update Schedule

    This dataset represents a point-in-time consolidation. Future versions may include: - Additional academic sources - Extended fields (authors, publication dates, venues) - Domain-specific subsets - Enhanced metadata

    📄 License and Usage

    Please respect the individual licenses of the source datasets. This consolidated version is provided for research and educational purposes. When using this dataset:

    1. Citation: Please cite this dataset and acknowledge the original data sources
    2. Attribution: Credit the original dataset creators listed above
    3. Compliance: Ensure compliance with individual dataset licenses
    4. Academic Use: Primarily intended for non-commercial, academic, and research purposes

    🙏 Acknowledgments

    Special thanks to all the original dataset creators and the academic communities that make their research data publicly available. This work builds upon their valuable contributions to open science and knowledge sharing.

    Keywords: academic papers, research abstracts, NLP, machine learning, text mining, scientific literature, ArXiv, PubMed, natural language processing, research dataset

  8. Data sets and machine learning models for: Predicting critical properties of...

    • zenodo.org
    bin, zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayandeep Biswas; Yunsie Chung; Yunsie Chung; Josephine Ramirez; Haoyang Wu; Haoyang Wu; William Green; William Green; Sayandeep Biswas; Josephine Ramirez (2023). Data sets and machine learning models for: Predicting critical properties of fluids using machine learning [Dataset]. http://doi.org/10.5281/zenodo.7804143
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Oct 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sayandeep Biswas; Yunsie Chung; Yunsie Chung; Josephine Ramirez; Haoyang Wu; Haoyang Wu; William Green; William Green; Sayandeep Biswas; Josephine Ramirez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The experimental data sets, data splits, additional features, QM calculations, model predictions, and final machine learning models for the manuscript "Predicting critical properties of fluids using multi-task machine learning". Citation should refer directly to the manuscript. (citation will be added soon)

    To use the machine learning models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop.

    Detailed information can be found in README.md file.

    Details on the properties considered

    The data set includes the following 8 properties:

    • Tc: critical temperature, in K
    • Pc: critical pressure, in bar
    • rhoc: critical density, in mol/L
    • omega: acentric factor, unitless
    • Tb: boiling point, in K
    • Tm: melting point, in K
    • dHvap: enthalpy of vaporization at boiling point, in kJ/mol
    • dHfus: enthalpy of fusion at melting point, in kJ/mol

    Details on the files

    1. Data sets under CritProp_v1.0.0:

    • all_data: includes the data sets used in this work. All data points are listed for each chemical compound as well as its corresponding data source. The details of the data sources can be found in the README.md file. The distribution of the data set is included in each folder.
      • estimated_data_for_pretraining: contains the estimated data from Yaws' handbook that are used to pre-train our machine learning (ML) model.
      • experimental_data: contains the experimental data used to fine-tune our ML model.
    • additional_features: includes the additional features tested for the ML model.
      • abraham: Abraham solute parameters (E, S, A, B, L). Molecular features.
      • acsf: ACSF (atom-centered symmetry functions). Atomic features that are coverted from the 3D coordinates of the compound
      • qm_atom: QM (quantum chemical) atomic feature.
      • qm_mol: QM molecular feature.
      • rdkit: Selected RDKit 2D molecular features.
    • data_splits_and_model_predictions: contains the training set and test set used to for random and scaffold splits. It also contains the predicted values from our final ML model for each test set.

    2. Machine learning (ML) model files:

    • CritProp_ML_model_fiiles_with_abraham_feat.zip: contains the Chemprop ML model files that are trained using Abraham features as additional molecular features. This gives the best results.
    • CritProp_ML_model_fiiles_without_additional_feat.zip: contains the Chemprop ML model files that are trained without any additional features. This gives the second best results.

    To use these ML models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop

    3. QM (quantum chemical) calculations:

    • QM_calculations.zip: contains the results of the QM calculations that are performed to compute QM features.

  9. PMOA-CITE dataset

    • figshare.com
    zip
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TONG ZENG (2023). PMOA-CITE dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12547574.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    TONG ZENG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"There are one million sentences in total, and further splitted into trainning, validation and testing by 60%, 20% and 20%, respectively.For the pre-processing of the dataset, please refer to the paper.The data are stored in jsonl format (each row is an json object), we list a couple of rows as example:{"sec_name":"introduction","cur_sent_id":"12213838@0#3$0","next_sent_id":"12213838@0#3$1","cur_sent":"All three spectrin subunits are essential for normal development.","next_sent":"βH, encoded by the karst locus, is an essential protein that is required for epithelial morphogenesis .","cur_scaled_len_features":{"type":1,"values":[0.17716535433070865,0.13513513513513514]},"next_scaled_len_features":{"type":1,"values":[0.32677165354330706,0.35135135135135137]},"cur_has_citation":0,"next_has_citation":1}{"sec_name":"results","prev_sent_id":"12230634@1@1#0$2","cur_sent_id":"12230634@1@1#0$3","next_sent_id":"12230634@1@1#0$4","prev_sent":"μIU/ml at the 2.0-h postprandial time point.","cur_sent":"Statistically significant differences between the mean plasma insulin levels of dogs treated with 50 mg/kg of GSNO, and those treated with 50 mg/kg GSNO and vitamin C (50 mg/kg) were observed at the 1.0-h and 1.5-h time points (P < 0.05).","next_sent":"The mean plasma insulin concentrations in the dogs treated with 50 mg/kg of vitamin C and 50 mg/kg of GSNO, or 50 mg/kg of GSNO was significantly altered compared to those of controls or captopril-treated dogs (P < 0.05).","prev_scaled_len_features":{"type":1,"values":[0.09448818897637795,0.08108108108108109]},"cur_scaled_len_features":{"type":1,"values":[0.8582677165354331,1.0]},"next_scaled_len_features":{"type":1,"values":[0.7913385826771654,0.9459459459459459]},"prev_has_citation":0,"cur_has_citation":0,"next_has_citation":0}{"sec_name":"results","prev_sent_id":"12213837@1@0#3$3","cur_sent_id":"12213837@1@0#3$4","next_sent_id":"12213837@1@0#3$5","prev_sent":"Cleavage of VAMP2 by BoNT/D releases the NH2-terminal 59 amino acids from the protein and eliminates exocytosis.","cur_sent":"However, in this case, exocytosis cannot be recovered by addition of the cleaved fragment .","next_sent":"Peptides that exactly correspond to the BoNT/D cleavage site (VAMP2 aa 25–59 and 60–94-cys) were equally efficient at mediating liposome fusion (unpublished data).","prev_scaled_len_features":{"type":1,"values":[0.36220472440944884,0.35135135135135137]},"cur_scaled_len_features":{"type":1,"values":[0.2795275590551181,0.2972972972972973]},"next_scaled_len_features":{"type":1,"values":[0.562992125984252,0.5135135135135135]},"prev_has_citation":0,"cur_has_citation":1,"next_has_citation":0}For the code using this dataset to modeling citation worthiness, please refer to https://github.com/sciosci/cite-worthiness

  10. o

    arrhythmia

    • openml.org
    Updated Apr 6, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu (2014). arrhythmia [Dataset]. https://www.openml.org/d/5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2014
    Authors
    H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu
    Description

    Author: H. Altay Guvenir, Burak Acar, Haldun Muderrisoglu
    Source: UCI
    Please cite: UCI

    Cardiac Arrhythmia Database
    The aim is to determine the type of arrhythmia from the ECG recordings. This database contains 279 attributes, 206 of which are linear valued and the rest are nominal.

    Concerning the study of H. Altay Guvenir: "The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes, 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However, there are differences between the cardiologist's and the program's classification. Taking the cardiologist's as a gold standard we aim to minimize this difference by means of machine learning tools.

    The names and id numbers of the patients were recently removed from the database.

    Attribute Information

      1 Age: Age in years , linear
      2 Sex: Sex (0 = male; 1 = female) , nominal
      3 Height: Height in centimeters , linear
      4 Weight: Weight in kilograms , linear
      5 QRS duration: Average of QRS duration in msec., linear
      6 P-R interval: Average duration between onset of P and Q waves
       in msec., linear
      7 Q-T interval: Average duration between onset of Q and offset
       of T waves in msec., linear
      8 T interval: Average duration of T wave in msec., linear
      9 P interval: Average duration of P wave in msec., linear
     Vector angles in degrees on front plane of:, linear
     10 QRS
     11 T
     12 P
     13 QRST
     14 J
     15 Heart rate: Number of heart beats per minute ,linear
     Of channel DI:
      Average width, in msec., of: linear
      16 Q wave
      17 R wave
      18 S wave
      19 R' wave, small peak just after R
      20 S' wave
      21 Number of intrinsic deflections, linear
      22 Existence of ragged R wave, nominal
      23 Existence of diphasic derivation of R wave, nominal
      24 Existence of ragged P wave, nominal
      25 Existence of diphasic derivation of P wave, nominal
      26 Existence of ragged T wave, nominal
      27 Existence of diphasic derivation of T wave, nominal
     Of channel DII: 
      28 .. 39 (similar to 16 .. 27 of channel DI)
     Of channels DIII:
      40 .. 51
     Of channel AVR:
      52 .. 63
     Of channel AVL:
      64 .. 75
     Of channel AVF:
      76 .. 87
     Of channel V1:
      88 .. 99
     Of channel V2:
      100 .. 111
     Of channel V3:
      112 .. 123
     Of channel V4:
      124 .. 135
     Of channel V5:
      136 .. 147
     Of channel V6:
      148 .. 159
     Of channel DI:
      Amplitude , * 0.1 milivolt, of
      160 JJ wave, linear
      161 Q wave, linear
      162 R wave, linear
      163 S wave, linear
      164 R' wave, linear
      165 S' wave, linear
      166 P wave, linear
      167 T wave, linear
      168 QRSA , Sum of areas of all segments divided by 10,
        ( Area= width * height / 2 ), linear
      169 QRSTA = QRSA + 0.5 * width of T wave * 0.1 * height of T
        wave. (If T is diphasic then the bigger segment is
        considered), linear
     Of channel DII:
      170 .. 179
     Of channel DIII:
      180 .. 189
     Of channel AVR:
      190 .. 199
     Of channel AVL:
      200 .. 209
     Of channel AVF:
      210 .. 219
     Of channel V1:
      220 .. 229
     Of channel V2:
      230 .. 239
     Of channel V3:
      240 .. 249
     Of channel V4:
      250 .. 259
     Of channel V5:
      260 .. 269
     Of channel V6:
      270 .. 279
    

    Class code - class - number of instances:

      01       Normal        245
      02       Ischemic changes (Coronary Artery Disease)  44
      03       Old Anterior Myocardial Infarction      15
      04       Old Inferior Myocardial Infarction      15
      05       Sinus tachycardy    13
      06       Sinus bradycardy    25
      07       Ventricular Premature Contraction (PVC)    3
      08       Supraventricular Premature Contraction    2
      09       Left bundle branch block     9 
      10       Right bundle branch block    50
      11       1. degree AtrioVentricular block    0 
      12       2. degree AV block        0
      13       3. degree AV block        0
      14       Left ventricule hypertrophy        4
      15       Atrial Fibrillation or Flutter        5
      16       Others         22
    
  11. d

    Galilee subregion groundwater usage estimates dataset v01

    • data.gov.au
    • researchdata.edu.au
    zip
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). Galilee subregion groundwater usage estimates dataset v01 [Dataset]. https://data.gov.au/data/dataset/068065ad-b7ac-4197-8837-2d362770017a
    Explore at:
    zip(14200605)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset authored and provided by
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Galilee
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    Do not use or publish - some source data used to create this version of the Galilee subregion groundwater usage estimates cannot be used and/or published by the Bioregional Assessments team due to licencing permissions. The restricted source data is not saved within the BA repository and is therefore not linked in the lineage. This version of the dataset has been replaced with the version 2 dataset which has had the restricted source data removed: Galilee subregion groundwater usage estimates dataset v02 (GUID: 339532fb-2ba6-424a-87fb-70d35df12abf)

    This dataset was created to provide an estimate of yearly groundwater use from active bores in the Galilee subregion.

    Dataset History

    The majority of bores in the Galilee subregion do not operate under a groundwater licensing arrangement. Thus, summarising groundwater licence allocations can underestimate the annual groundwater withdrawals from aquifer systems. A dataset for estimating yearly groundwater use from all bores, in ML/year, was compiled using the following steps:

    1. compile a list of all bores in the Galilee subregion using data available from the Queensland groundwater database

    2. for each bore, where data are available incorporate interpreted stratigraphic picks for screened intervals into the dataset

    3. for each bore, where data are available, incorporate the most recent standing water level and bore maximum discharge data into the dataset. Maximum discharge will need to be re-calculated from L/second to ML/year

    4. from the water licence dataset, incorporate the licensed water allocation volume, bore use and GMA information

    5. incorporate relevant Great Artesian Basin Sustainability Initiative (GABSI) data into the dataset and ensure that information in GABSI dataset is accurately reflected by the bore facility status records

    6. investigate the bore facility status records. Bore facility status categories include: existing; abandoned but usable; abandoned and destroyed and proposed. Only those classed as 'existing' or 'abandoned but usable' were kept in the dataset. It is assumed that bores in other categories are not functional

    7. interrogate bore use records. Remove any bore from the dataset that is tagged as a monitoring bore. It is assumed that monitoring bores are not being used for any purpose other than groundwater monitoring

    8. insert two new blank columns, 'BA groundwater usage' and 'groundwater use source' in the dataset. The 'BA groundwater usage' column is where the estimate for annual groundwater usage is recorded for a bore in ML/year. The 'groundwater use source' column is where the decision on how yearly groundwater usage is assigned is recorded

    9. populate the 'BA groundwater usage' and 'groundwater use source' columns.

    Queensland Government (Queensland Government, 2014, pers. comm.) provides some information on the estimation of annual water usage for groundwater bores in Queensland. Some steps to determining an estimate of groundwater usage for each bore are as follows:

    1. populate the BA groundwater usage column with water licence allocations that are greater than 0 ML/year. While the full allocation may not actually be used, this will provide a maximum allowable water allocation that could be pumped from a particular area. This has the potential to conserve the unused allocations when estimating groundwater usage for an area

    2. sub-artesian bores - Queensland Government (Queensland Government, 2014, pers. comm.) suggests 5 ML/year or bore maximum flow rate, whichever is least

    3. controlled Artesian Bore - Queensland Government (Queensland Government, 2014, pers. comm.) suggests 30 ML/year or the maximum flow rate, whichever is least

    4. uncontrolled Artesian Bore - Queensland Government (Queensland Government, 2014, pers. comm.) suggests use the flow rate in ML/year

    5. uncontrolled artesian bores missing flow rate and standing water level information - the average flow rate for all uncontrolled artesian bores located within the Galilee subregion was calculated from existing data. The average was then assigned as nominal value for uncontrolled artesian bores with no flow rate data. For Galilee subregion this equated to 124 ML/year.

    Dataset Citation

    Geoscience Australia (XXXX) Galilee subregion groundwater usage estimates dataset v01. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/068065ad-b7ac-4197-8837-2d362770017a.

    Dataset Ancestors

  12. f

    Data from: Annotated dataset of simulated voiding sound for urine flow...

    • springernature.figshare.com
    • portaldelaciencia.uva.es
    application/x-rar
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcos Lazaro Alvarez; Laura Arjona; Alfonso Bahillo; Ganeko Bernardo (2025). Annotated dataset of simulated voiding sound for urine flow estimation [Dataset]. http://doi.org/10.6084/m9.figshare.27606642.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    figshare
    Authors
    Marcos Lazaro Alvarez; Laura Arjona; Alfonso Bahillo; Ganeko Bernardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset of simulated voiding sound for urine flow estimation

    Overview

    This repository contains a dataset of synthetic urination audio signals generated under controlled conditions. The dataset is intended for research in sound-based uroflowmetry (SU) and the development of machine learning models for voiding flow estimation. The audio samples were generated using a high-precision peristaltic pump to simulate flow rates between 1 and 50 ml/s and were recorded with three different microphone devices.

    Experimental Setup

    The audio recordings were conducted in a bathroom setup where the synthetic urine stream generated by the peristaltic pump was directed into a standard ceramic toilet containing a fixed volume of water at the bottom, ensuring that the sound was produced by the interaction of the liquid stream with the water surface. This configuration mimics realistic voiding conditions and allows the captured audio to resemble actual urination events.

    Dataset Structure

    Each audio file is a 60-second segment labeled with its corresponding flow rate (in ml/s). The naming convention is:

    [device]_f_[flow]_[duration]s.wav

    • device: UM (Ultramic384k), Phone (Mi A1), or Watch (Oppo Smartwatch)
    • flow: flow rate from 1 to 50 ml/s
    • duration: fixed to 60 seconds for all files

    Example:

    um_f_20_60s.wav phone_f_45_60s.wav watch_f_10_60s.wav

    Silence Reference Recordings

    In addition to the voiding audio files, each device folder includes a 30-second silence recording. These recordings were captured in the same environment and setup, but without any synthetic flow, allowing for baseline noise analysis. They serve as a reference to evaluate the background noise characteristics of each device and to support preprocessing techniques such as noise reduction or signal enhancement.

    Filename format:

    [device]_f_0_30s.wav

    Example:

    • um_f_0s.wav
    • phone_f_0s.wav
    • oppo_f_0s.wav

    Purpose

    The goal of this dataset is to provide a standardized audio repository for the development, training and validation of machine learning algorithms for voiding flow prediction. This enables researchers to: - Benchmark different approaches on a common dataset - Develop flow estimation models using synthetic audio before transferring them to real-world applications - Explore the spectral and temporal structure of urination-related audio signals

    Flow Generation

    • Pump Used: L600-1F precision peristaltic pump
    • Flow Range: 1–50 ml/s (based on ICS-reported ranges for male uroflowmetry)
    • Calibration: Pump flows were validated using a graduated cylinder
    • Noise Isolation: The pump was placed in a separate room (via 15m silicone tubing) to eliminate pump noise from recordings

    Recording Devices

    DeviceSampling RateFrequency RangeDescription
    UM192 kHz0–96 kHzHigh-quality ultrasonic microphone
    Phone48 kHz0–24 kHzAndroid smartphone (Mi A1)
    Watch44.1 kHz0–22.05 kHzOppo Smartwatch with built-in mic

    Each recording was carried out using a custom mobile or desktop app with preset parameters.

    Recording Environment

    • Recordings were made in a bathroom with a standard ceramic toilet containing water at the bottom.
    • The nozzle height varied between 73–86 cm depending on flow rate to ensure consistent water impact.
    • Microphone heights:
      • UM: 84 cm
      • Phone: 95 cm
      • Watch: 86 cm (simulating wrist height)

    Data Collection Protocol

    1. Pump activated with flow set from 1 to 50 ml/s.
    2. Audio recorded simultaneously with UM, Phone and Watch for 80 seconds.
    3. Initial 15 seconds and final 5 seconds trimmed to retain 60 seconds of steady-state urination sound.

    Citation

    If you use this dataset in your work, please cite the associated paper:

    M. L. Alvarez et al., “Annotated dataset of simulated voiding sound for urine flow estimation”, 2025. (Pending publication)

    License

    This dataset is made available for research purposes under a CC BY license.

    Contact

    For questions, please contact:

    Marcos Lazaro Alvarez Faculty of Engineering, University of Deusto
    alvarez.marcoslazaro@deusto.es

  13. o

    kr-vs-kp

    • openml.org
    Updated Apr 6, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alen Shapiro (2014). kr-vs-kp [Dataset]. https://www.openml.org/d/3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2014
    Authors
    Alen Shapiro
    Description

    Author: Alen Shapiro Source: UCI Please cite: UCI citation policy

    1. Title: Chess End-Game -- King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7). The pawn on a7 means it is one square away from queening. It is the King+Rook's side (white) to move.

    2. Sources: (a) Database originally generated and described by Alen Shapiro. (b) Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk). (c) Date: 1 August 1989

    3. Past Usage:

    4. Alen D. Shapiro (1983,1987), "Structured Induction in Expert Systems", Addison-Wesley. This book is based on Shapiro's Ph.D. thesis (1983) at the University of Edinburgh entitled "The Role of Structured Induction in Expert Systems".

    5. Stephen Muggleton (1987), "Structuring Knowledge by Asking Questions", pp.218-229 in "Progress in Machine Learning", edited by I. Bratko and Nada Lavrac, Sigma Press, Wilmslow, England SK9 5BB.

    6. Robert C. Holte, Liane Acker, and Bruce W. Porter (1989), "Concept Learning and the Problem of Small Disjuncts", Proceedings of IJCAI. Also available as technical report AI89-106, Computer Sciences Department, University of Texas at Austin, Austin, Texas 78712.

    7. Relevant Information: The dataset format is described below. Note: the format of this database was modified on 2/26/90 to conform with the format of all the other databases in the UCI repository of machine learning databases.

    8. Number of Instances: 3196 total

    9. Number of Attributes: 36

    10. Attribute Summaries: Classes (2): -- White-can-win ("won") and White-cannot-win ("nowin"). I believe that White is deemed to be unable to win if the Black pawn can safely advance. Attributes: see Shapiro's book.

    11. Missing Attributes: -- none

    12. Class Distribution: In 1669 of the positions (52%), White can win. In 1527 of the positions (48%), White cannot win.

    The format for instances in this database is a sequence of 37 attribute values. Each instance is a board-descriptions for this chess endgame. The first 36 attributes describe the board. The last (37th) attribute is the classification: "win" or "nowin". There are 0 missing values. A typical board-description is

    f,f,f,f,f,f,f,f,f,f,f,f,l,f,n,f,f,t,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,t,t,n,won

    The names of the features do not appear in the board-descriptions. Instead, each feature correponds to a particular position in the feature-value list. For example, the head of this list is the value for the feature "bkblk". The following is the list of features, in the order in which their values appear in the feature-value list:

    [bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd, hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr, skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg]

    In the file, there is one instance (board position) per line.

    Num Instances: 3196 Num Attributes: 37 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 37 Missing values: 0 / 0.0%

  14. IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. http://doi.org/10.5281/zenodo.8116338
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Article Information

    The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

    Please do cite the aforementioned article when using this dataset.

    Abstract

    The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

    ZIP Folder Content

    The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

    To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

    This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

    Datasets' Content

    Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

    Identified Key Features Within Bluetooth Dataset

    FeatureMeaning
    btle.advertising_headerBLE Advertising Packet Header
    btle.advertising_header.ch_selBLE Advertising Channel Selection Algorithm
    btle.advertising_header.lengthBLE Advertising Length
    btle.advertising_header.pdu_typeBLE Advertising PDU Type
    btle.advertising_header.randomized_rxBLE Advertising Rx Address
    btle.advertising_header.randomized_txBLE Advertising Tx Address
    btle.advertising_header.rfu.1Reserved For Future 1
    btle.advertising_header.rfu.2Reserved For Future 2
    btle.advertising_header.rfu.3Reserved For Future 3
    btle.advertising_header.rfu.4Reserved For Future 4
    btle.control.instantInstant Value Within a BLE Control Packet
    btle.crc.incorrectIncorrect CRC
    btle.extended_advertisingAdvertiser Data Information
    btle.extended_advertising.didAdvertiser Data Identifier
    btle.extended_advertising.sidAdvertiser Set Identifier
    btle.lengthBLE Length
    frame.cap_lenFrame Length Stored Into the Capture File
    frame.interface_idInterface ID
    frame.lenFrame Length Wire
    nordic_ble.board_idBoard ID
    nordic_ble.channelChannel Index
    nordic_ble.crcokIndicates if CRC is Correct
    nordic_ble.flagsFlags
    nordic_ble.packet_counterPacket Counter
    nordic_ble.packet_timePacket time (start to end)
    nordic_ble.phyPHY
    nordic_ble.protoverProtocol Version

    Identified Key Features Within IP-Based Packets Dataset

    FeatureMeaning
    http.content_lengthLength of content in an HTTP response
    http.requestHTTP request being made
    http.response.codeSequential number of an HTTP response
    http.response_numberSequential number of an HTTP response
    http.timeTime taken for an HTTP transaction
    tcp.analysis.initial_rttInitial round-trip time for TCP connection
    tcp.connection.finTCP connection termination with a FIN flag
    tcp.connection.synTCP connection initiation with SYN flag
    tcp.connection.synackTCP connection establishment with SYN-ACK flags
    tcp.flags.cwrCongestion Window Reduced flag in TCP
    tcp.flags.ecnExplicit Congestion Notification flag in TCP
    tcp.flags.finFIN flag in TCP
    tcp.flags.nsNonce Sum flag in TCP
    tcp.flags.resReserved flags in TCP
    tcp.flags.synSYN flag in TCP
    tcp.flags.urgUrgent flag in TCP
    tcp.urgent_pointerPointer to urgent data in TCP
    ip.frag_offsetFragment offset in IP packets
    eth.dst.igEthernet destination is in the internal network group
    eth.src.igEthernet source is in the internal network group
    eth.src.lgEthernet source is in the local network group
    eth.src_not_groupEthernet source is not in any network group
    arp.isannouncementIndicates if an ARP message is an announcement

    Identified Key Features Within IP-Based Flows Dataset

    FeatureMeaning
    protoTransport layer protocol of the connection
    serviceIdentification of an application protocol
    orig_bytesOriginator payload bytes
    resp_bytesResponder payload bytes
    historyConnection state history
    orig_pktsOriginator sent packets
    resp_pktsResponder sent packets
    flow_durationLength of the flow in seconds
    fwd_pkts_totForward packets total
    bwd_pkts_totBackward packets total
    fwd_data_pkts_totForward data packets total
    bwd_data_pkts_totBackward data packets total
    fwd_pkts_per_secForward packets per second
    bwd_pkts_per_secBackward packets per second
    flow_pkts_per_secFlow packets per second
    fwd_header_sizeForward header bytes
    bwd_header_sizeBackward header bytes
    fwd_pkts_payloadForward payload bytes
    bwd_pkts_payloadBackward payload bytes
    flow_pkts_payloadFlow payload bytes
    fwd_iatForward inter-arrival time
    bwd_iatBackward inter-arrival time
    flow_iatFlow inter-arrival time
    activeFlow active duration
  15. Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT...

    • zenodo.org
    bin, text/x-python
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Ganscha; Stefan Ganscha; Oliver T. Unke; Oliver T. Unke; Daniel Ahlin; Daniel Ahlin; Hartmut Maennel; Hartmut Maennel; Sergii Kashubin; Sergii Kashubin; Klaus-Robert Mueller; Klaus-Robert Mueller (2025). Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations [Dataset]. http://doi.org/10.5281/zenodo.14859804
    Explore at:
    text/x-python, binAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefan Ganscha; Stefan Ganscha; Oliver T. Unke; Oliver T. Unke; Daniel Ahlin; Daniel Ahlin; Hartmut Maennel; Hartmut Maennel; Sergii Kashubin; Sergii Kashubin; Klaus-Robert Mueller; Klaus-Robert Mueller
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    2024
    Description

    Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g. Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations.

    The data is available as TensorFlow dataset (TFDS) and can be accessed from the publicly available Google Cloud Storage at gs://qcml-datasets/tfds/. (See "Directory structure" below.)

    For information on different access options (command-line tools, client libraries, etc), please see https://cloud.google.com/storage/docs/access-public-data.

    Directory structure

    • gs://qcml-datasets (GCS Bucket)
      • tfds (TFDS data directory)
        • qcml (TFDS dataset name)
          • dft_atomic_numbers (TFDS builder config name)
            • 1.0.0 (Current version)
              • dataset_info.json
              • features.json
              • qcml-full.tfrecord-X-of-Y (TFDS data shards, see below)
          • ...
          • dft_positions
          • xtb_all

    Builder configurations

    Format: Builder config name: number of shards (rounded total size)

    Semi-empirical calculations:

    • xtb_all: 85000 (69 TB)

    DFT calculations:

    • dft_atomic_numbers: 11 (3 GB)
    • dft_d4_atomic_charges: 11 (4 GB)
    • dft_d4_c6_coefficients: 11 (4 GB)
    • dft_d4_correction: 11 (8 GB)
    • dft_d4_energy: 11 (2 GB)
    • dft_d4_forces: 11 (7 GB)
    • dft_d4_polarizabilities: 11 (4 GB)
    • dft_force_field: 11 (18 GB)
    • dft_force_field_d4: 110 (24 GB)
    • dft_force_field_mbd: 110 (24 GB)
    • dft_gfn0_dipole: 11 (3 GB)
    • dft_gfn0_eeq_charges: 11 (4 GB)
    • dft_gfn0_energy: 11 (2 GB)
    • dft_gfn0_forces: 11 (7 GB)
    • dft_gfn0_formation_energy: 11 (3 GB)
    • dft_gfn0_orbital_energies_a: 11 (8 GB)
    • dft_gfn0_orbital_occupations_a: 11 (8 GB)
    • dft_gfn0_wiberg_bond_orders: 110 (29 GB)
    • dft_gfn2_dipole: 11 (3 GB)
    • dft_gfn2_energy: 11 (2 GB)
    • dft_gfn2_forces: 11 (7 GB)
    • dft_gfn2_formation_energy: 11 (3 GB)
    • dft_gfn2_mulliken_charges: 11 (4 GB)
    • dft_gfn2_orbital_energies_a: 11 (7 GB)
    • dft_gfn2_orbital_occupations_a: 11 (7 GB)
    • dft_gfn2_wiberg_bond_orders: 110 (29 GB)
    • dft_is_outlier: 11 (2 GB)
    • dft_mbd_c6_coefficients: 11 (4 GB)
    • dft_mbd_correction: 11 (8 GB)
    • dft_mbd_energy: 11 (2 GB)
    • dft_mbd_forces: 11 (7 GB)
    • dft_mbd_polarizabilities: 11 (4 GB)
    • dft_metadata: 11 (11 GB)
    • dft_multipole_moments: 11 (8 GB)
    • dft_pbe0_core_hamiltonian_matrix: 110000 (30 TB)
    • dft_pbe0_density_matrix_a: 110000 (30 TB)
    • dft_pbe0_density_matrix_b: 110000 (3 TB)
    • dft_pbe0_dipole: 11 (3 GB)
    • dft_pbe0_electronic_free_energy: 11 (3 GB)
    • dft_pbe0_energy: 11 (2 GB)
    • dft_pbe0_forces: 11 (7 GB)
    • dft_pbe0_formation_energy: 11 (3 GB)
    • dft_pbe0_grid_density_a: 110000 (27 TB)
    • dft_pbe0_grid_density_b: 110000 (3 TB)
    • dft_pbe0_grid_density_gradient_a: 110000 (81 TB)
    • dft_pbe0_grid_density_gradient_b: 110000 (10 TB)
    • dft_pbe0_grid_density_laplacian_a: 110000 (27 TB)
    • dft_pbe0_grid_density_laplacian_b: 110000 (3 TB)
    • dft_pbe0_grid_kinetic_energy_density_a: 110000 (27 TB)
    • dft_pbe0_grid_kinetic_energy_density_b: 110000 (3 TB)
    • dft_pbe0_grid_points: 110000 (81 TB)
    • dft_pbe0_grid_weight: 110000 (27 TB)
    • dft_pbe0_guid: 11 (3 GB)
    • dft_pbe0_hamiltonian_matrix_a: 110000 (30 TB)
    • dft_pbe0_hamiltonian_matrix_b: 110000 (3 TB)
    • dft_pbe0_has_equal_a_b_electrons: 11 (3 GB)
    • dft_pbe0_hexadecapole: 11 (3 GB)
    • dft_pbe0_hirshfeld_charges: 11 (4 GB)
    • dft_pbe0_hirshfeld_dipoles: 11 (8 GB)
    • dft_pbe0_hirshfeld_quadrupoles: 11 (11 GB)
    • dft_pbe0_hirshfeld_spins: 11 (3 GB)
    • dft_pbe0_hirshfeld_volume_ratios: 11 (4 GB)
    • dft_pbe0_hirshfeld_volumes: 11 (4 GB)
    • dft_pbe0_loewdin_charges: 11 (4 GB)
    • dft_pbe0_loewdin_spins: 11 (3 GB)
    • dft_pbe0_mulliken_charges: 11 (4 GB)
    • dft_pbe0_mulliken_spins: 11 (3 GB)
    • dft_pbe0_num_scf_iterations: 11 (3 GB)
    • dft_pbe0_octupole: 11 (3 GB)
    • dft_pbe0_orbital_coefficients_a: 110000 (30 TB)
    • dft_pbe0_orbital_coefficients_b: 110000 (3 TB)
    • dft_pbe0_orbital_energies_a: 110 (44 GB)
    • dft_pbe0_orbital_energies_b: 11 (8 GB)
    • dft_pbe0_orbital_occupations_a: 110 (44 GB)
    • dft_pbe0_orbital_occupations_b: 11 (8 GB)
    • dft_pbe0_overlap_matrix: 110000 (30 TB)
    • dft_pbe0_quadrupole: 11 (3 GB)
    • dft_pbe0_zero_broadening_corrected_energy: 11 (3 GB)
    • dft_population_analysis: 11 (19 GB)
    • dft_positions: 11 (7 GB)
  16. PineTime heart rate dataset

    • zenodo.org
    zip
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska (2023). PineTime heart rate dataset [Dataset]. http://doi.org/10.5281/zenodo.8220127
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of heart rate measurements collected from the PineTime wristband, with a gold standard reference.

    Contents

    The repository contains both the raw and the "merged", clean data. The merged data is much easier to work with and should be used when building machine learning models. The raw data is provided for transparency, reproducibility, and to allow for studies that could use the other data collected from the Equivital device.

    • schedule.md – schedule of the study, indicating the start and end times of each exercise and break.
    • data_raw/ – raw data collected from the PineTime wristband and the Equivital device. Each subdirectory corresponds to one participant. The files are in the Feather format.
    • data_merged/ – merged data series that can be used for building ML models. The files are in JSON format and follow a nested structure, where each heart rate measurement is associated with a series of acceleration measurements that preceded it. Each file corresponds to one continuous measurement session – there are sometimes multiple sessions per participant due to intermittent hardware failures.

    Citation

    If you use this data in research works, please cite the following paper:

    Sowiński, P., Rachwał, K., Danilenka, A., Bogacka, K., Kobus, M., Dąbrowska, A., Paszkiewicz, A., et al. (2023). Frugal Heart Rate Correction Method for Scalable Health and Safety Monitoring in Construction Sites. Sensors, 23(14), 6464. MDPI AG. Retrieved from http://dx.doi.org/10.3390/s23146464

    BibTeX:

    @article{sowinski2023frugal,
     title={Frugal Heart Rate Correction Method for Scalable Health and Safety Monitoring in Construction Sites},
     author={Sowi{\'n}ski, Piotr and Rachwa{\l}, Kajetan and Danilenka, Anastasiya and Bogacka, Karolina and Kobus, Monika and D{\k{a}}browska, Anna and Paszkiewicz, Andrzej and Bolanowski, Marek and Ganzha, Maria and Paprzycki, Marcin},
     journal={Sensors},
     volume={23},
     number={14},
     pages={6464},
     year={2023},
     publisher={MDPI},
     url = {https://www.mdpi.com/1424-8220/23/14/6464},
     doi = {10.3390/s23146464}
    }

    Authors

    Acknowledgements

    This work is part of the ASSIST-IoT project that has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreement No 957258.

    The Central Institute for Labour Protection – National Research Institute provided facilities and equipment for data collection.

    License

    The dataset is licensed under the Creative Commons Attribution 4.0 International License.

  17. E-Commerce Customer Behavior & Sales Analysis -TR

    • kaggle.com
    zip
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). E-Commerce Customer Behavior & Sales Analysis -TR [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/e-commerce-customer-behavior-and-sales-analysis-tr
    Explore at:
    zip(138245 bytes)Available download formats
    Dataset updated
    Oct 29, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🛒 E-Commerce Customer Behavior and Sales Dataset 📊 Dataset Overview This comprehensive dataset contains 5,000 e-commerce transactions from a Turkish online retail platform, spanning from January 2023 to March 2024. The dataset provides detailed insights into customer demographics, purchasing behavior, product preferences, and engagement metrics.

    🎯 Use Cases This dataset is perfect for:

    Customer Segmentation Analysis: Identify distinct customer groups based on behavior Sales Forecasting: Predict future sales trends and patterns Recommendation Systems: Build product recommendation engines Customer Lifetime Value (CLV) Prediction: Estimate customer value Churn Analysis: Identify customers at risk of leaving Marketing Campaign Optimization: Target customers effectively Price Optimization: Analyze price sensitivity across categories Delivery Performance Analysis: Optimize logistics and shipping 📁 Dataset Structure The dataset contains 18 columns with the following features:

    Order Information Order_ID: Unique identifier for each order (ORD_XXXXXX format) Date: Transaction date (2023-01-01 to 2024-03-26) Customer Demographics Customer_ID: Unique customer identifier (CUST_XXXXX format) Age: Customer age (18-75 years) Gender: Customer gender (Male, Female, Other) City: Customer city (10 major Turkish cities) Product Information Product_Category: 8 categories (Electronics, Fashion, Home & Garden, Sports, Books, Beauty, Toys, Food) Unit_Price: Price per unit (in TRY/Turkish Lira) Quantity: Number of units purchased (1-5) Transaction Details Discount_Amount: Discount applied (if any) Total_Amount: Final transaction amount after discount Payment_Method: Payment method used (5 types) Customer Behavior Metrics Device_Type: Device used for purchase (Mobile, Desktop, Tablet) Session_Duration_Minutes: Time spent on website (1-120 minutes) Pages_Viewed: Number of pages viewed during session (1-50) Is_Returning_Customer: Whether customer has purchased before (True/False) Post-Purchase Metrics Delivery_Time_Days: Delivery duration (1-30 days) Customer_Rating: Customer satisfaction rating (1-5 stars) 📈 Key Statistics Total Records: 5,000 transactions Date Range: January 2023 - March 2024 (15 months) Average Transaction Value: ~450 TRY Customer Satisfaction: 3.9/5.0 average rating Returning Customer Rate: 60% Mobile Usage: 55% of transactions 🔍 Data Quality ✅ No missing values ✅ Consistent formatting across all fields ✅ Realistic data distributions ✅ Proper data types for all columns ✅ Logical relationships between features 💡 Sample Analysis Ideas Customer Segmentation with K-Means Clustering

    Segment customers based on spending, frequency, and recency Sales Trend Analysis

    Identify seasonal patterns and peak shopping periods Product Category Performance

    Compare revenue, ratings, and return rates across categories Device-Based Behavior Analysis

    Understand how device choice affects purchasing patterns Predictive Modeling

    Build models to predict customer ratings or purchase amounts City-Level Market Analysis

    Compare market performance across different cities 🛠️ Technical Details File Format: CSV (Comma-Separated Values) Encoding: UTF-8 File Size: ~500 KB Delimiter: Comma (,) 📚 Column Descriptions Column Name Data Type Description Example Order_ID String Unique order identifier ORD_001337 Customer_ID String Unique customer identifier CUST_01337 Date DateTime Transaction date 2023-06-15 Age Integer Customer age 35 Gender String Customer gender Female City String Customer city Istanbul Product_Category String Product category Electronics Unit_Price Float Price per unit 1299.99 Quantity Integer Units purchased 2 Discount_Amount Float Discount applied 129.99 Total_Amount Float Final amount paid 2469.99 Payment_Method String Payment method Credit Card Device_Type String Device used Mobile Session_Duration_Minutes Integer Session time 15 Pages_Viewed Integer Pages viewed 8 Is_Returning_Customer Boolean Returning customer True Delivery_Time_Days Integer Delivery duration 3 Customer_Rating Integer Satisfaction rating 5 🎓 Learning Outcomes By working with this dataset, you can learn:

    Data cleaning and preprocessing techniques Exploratory Data Analysis (EDA) with Python/R Statistical analysis and hypothesis testing Machine learning model development Data visualization best practices Business intelligence and reporting 📝 Citation If you use this dataset in your research or project, please cite:

    E-Commerce Customer Behavior and Sales Dataset (2024) Turkish Online Retail Platform Data (2023-2024) Available on Kaggle ⚖️ License This dataset is released under the CC0: Public Domain license. You are free to use it for any purpose.

    🤝 Contribution Found any issues or have suggestions? Feel free to provide feedback!

    📞 Contact For questions or collaborations, please reach out through Kaggle.

    Happy Analyzing! 🚀

    Keywords: e-c...

  18. Z

    2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix (2023). 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning: Slices 2,001-3,000 (reference reconstructions and segmentations) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8017611
    Explore at:
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    University of Manchester
    Leiden University
    Centrum Wiskunde & Informatica
    Authors
    Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains the reference reconstructions and segmentation of slices 2,001 – 3,000 from the data collection described in

    Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

    Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

    The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, (74.8\mu m^2) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

    Please refer to the paper for all further technical details.

    The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD. The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.

    The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

    Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

    For more information or guidance in using the data collection, please get in touch with

    Maximilian.Kiss [at] cwi.nl
    
    
    Felix.Lucka [at] cwi.nl
    
  19. Z

    2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...

    • data.niaid.nih.gov
    Updated Sep 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix (2023). 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning: Slices 3,001-4,000 (reference reconstructions and segmentations) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8017617
    Explore at:
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    University of Manchester
    Leiden University
    Centrum Wiskunde & Informatica
    Authors
    Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains the reference reconstructions and segmentation of slices 3,001 – 4,000 from the data collection described in

    Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

    Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

    The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, (74.8\mu m^2) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

    Please refer to the paper for all further technical details.

    The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD. The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.

    The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

    Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

    For more information or guidance in using the data collection, please get in touch with

    Maximilian.Kiss [at] cwi.nl
    
    
    Felix.Lucka [at] cwi.nl
    
  20. n

    Data from: Large-scale integration of single-cell transcriptomic data...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +2more
    zip
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration [Dataset]. http://doi.org/10.5061/dryad.t4b8gtj34
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 14, 2021
    Dataset provided by
    Cornell University
    Authors
    David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.

    Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.

    Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).

    Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.

    Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).

    Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).

    Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.

    Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.

    Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).

    Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/
Organization logo

News Category Dataset

Identify the type of news based on headlines and short descriptions

Explore at:
44 scholarly articles cite this dataset (View in Google Scholar)
zip(27829769 bytes)Available download formats
Dataset updated
Sep 24, 2022
Authors
Rishabh Misra
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Context

** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

Content

Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

  • POLITICS: 35602

  • WELLNESS: 17945

  • ENTERTAINMENT: 17362

  • TRAVEL: 9900

  • STYLE & BEAUTY: 9814

  • PARENTING: 8791

  • HEALTHY LIVING: 6694

  • QUEER VOICES: 6347

  • FOOD & DRINK: 6340

  • BUSINESS: 5992

  • COMEDY: 5400

  • SPORTS: 5077

  • BLACK VOICES: 4583

  • HOME & LIVING: 4320

  • PARENTS: 3955

Citation

If you're using this dataset for your work, please cite the following articles:

Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

Acknowledgements

This dataset was collected from HuffPost.

Inspiration

  • Can you categorize news articles based on their headlines and short descriptions?

  • Do news articles from different categories have different writing styles?

  • A classifier trained on this dataset could be used on a free text to identify the type of language being used.

Want to contribute your own datasets?

If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

Other datasets

Please also checkout the following datasets collected by me:

Search
Clear search
Close search
Google apps
Main menu