100+ datasets found

News Category Dataset
kaggle.com
zip
Updated Sep 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/
Explore at:
zip(27829769 bytes)Available download formats
Dataset updated
Sep 24, 2022
Authors
Rishabh Misra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

Content

Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

POLITICS: 35602

WELLNESS: 17945

ENTERTAINMENT: 17362

TRAVEL: 9900

STYLE & BEAUTY: 9814

PARENTING: 8791

HEALTHY LIVING: 6694

QUEER VOICES: 6347

FOOD & DRINK: 6340

BUSINESS: 5992

COMEDY: 5400

SPORTS: 5077

BLACK VOICES: 4583

HOME & LIVING: 4320

PARENTS: 3955

Citation

If you're using this dataset for your work, please cite the following articles:

Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

Acknowledgements

This dataset was collected from HuffPost.

Inspiration

Can you categorize news articles based on their headlines and short descriptions?

Do news articles from different categories have different writing styles?

A classifier trained on this dataset could be used on a free text to identify the type of language being used.

Want to contribute your own datasets?

If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

Other datasets

Please also checkout the following datasets collected by me:

News Headlines Dataset For Sarcasm Detection

Clothing Fit Dataset for Size Recommendation

IMDB Spoiler Dataset

Politifact Fact Check Dataset
Z
Dataset for Machine Learning Assisted Citation Screening for Systematic...
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Dec 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhrangadhariya, Anjani (2023). Dataset for Machine Learning Assisted Citation Screening for Systematic Reviews [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10423426
Explore at:
Dataset updated
Dec 22, 2023
Dataset provided by
Hilfiker, Roger
Dhrangadhariya, Anjani
Müller, Henning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The work "Machine Learning Assisted Citation Screening for Systematic Reviews" explored the problem of citation screening automation using machine-learning (ML) with an aim to accelerate the process of generating systematic reviews. Manual process of citation screening involve two reviewers manually screening the searched studies using a predefined inclusion criteria. If the study passes the "inclusion" criteria, it is included for further analysis or is excluded. As apparant through manual screening process, the work considered citation screening as a binary classification problem whereby any ML classifier could be trained to separate the searched studies into these two classes (include and exclude).

A physiotherapy citation screening dataset was used to test automation approaches and the dataset includes the studies identified for citation screening in an update to the systematic review by Hilfiker et al. The dataset included titles and abstracts (citations) from 31,279 (deduplicated: 25,540) studies identified during the search phase of this SR. These studies were already manually assessed for relevance and labelled by two reviewers into two mutually exclusive labels. The uploaded file consists of 25,540 data samples, with each data sample separated by a new line. It is a tab separated file and the data in it is structured as shown below. This dataset was manually labelled into include and exclude by Hilfiker et al.

Title PMID Abstract Class MeSH terms (separated by a pipe)

Structured exercise improves physical functioning in women with stages I and II breast cancer: results of a randomized controlled trial.
11157015 Abstract PURPOSE: Self-directed and supervised exercise were compared with usual care in a clinical trial designed to evaluate the effect of structured exercise on physical functioning and other dimensions of health-related quality of life in women with stages I and II breast cancer. PATIENTS AND METHODS: One hundred twenty-three women with stages I and II breast cancer completed baseline evaluations of generic and disease- and site-specific health-related quality of life, aerobic capacity, and body weight. Participants were randomly allocated to one of three intervention groups: usual care (control group), self-directed exercise, or supervised exercise. Quality of life, aerobic capacity, and body weight measures were repeated at 26 weeks... include or exclude Clinical Trial | Comparative Study | Randomized Controlled Trial | Research Support, Non-U.S. Gov't | Antineoplastic Combined Chemotherapy Protocols | Breast Neoplasms | Breast Neoplasms | Breast Neoplasms | Chemotherapy, Adjuvant | Exercise | Female | Humans | Middle Aged | Neoplasm Staging | Quality of Life | Radiotherapy, Adjuvant

If you use this dataset in your research, please cite our papers.
o
PhishingWebsites
openml.org
Updated Feb 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae) (2016). PhishingWebsites [Dataset]. https://www.openml.org/d/4534
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2016
Authors
Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae)
Description
Author: Rami Mustafa A Mohammad ( University of Huddersfield","rami.mohammad '@' hud.ac.uk","rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield","t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai","fadi '@' cud.ac.ae)
Source: UCI
Please cite: Please refer to the Machine Learning Repository's citation policy

Source:

Rami Mustafa A Mohammad ( University of Huddersfield, rami.mohammad '@' hud.ac.uk, rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield,t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai,fadi '@' cud.ac.ae)

Data Set Information:

One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.

Attribute Information:

For Further information about the features see the features file in the data folder of UCI.

Relevant Papers:

Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0

Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709

Citation Request:

Please refer to the Machine Learning Repository's citation policy
UCI ML Parkinsons dataset
kaggle.com
zip
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elnaz Alikarami (2025). UCI ML Parkinsons dataset [Dataset]. https://www.kaggle.com/datasets/elnazalikarami/uci-ml-parkinsons-dataset
Explore at:
zip(316796 bytes)Available download formats
Dataset updated
Jul 8, 2025
Authors
Elnaz Alikarami
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository

dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons

Dataset Characteristics Multivariate

Subject Area Health and Medicine

Associated Tasks Classification

Feature Type Real

Instances

197

Features

22

Dataset Information Additional Information

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

Has Missing Values?

No
JARVIS ML Training Data
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal Choudhary; Brian DeCost; Francesca Tavazza; Hacking Materials (2023). JARVIS ML Training Data [Dataset]. http://doi.org/10.6084/m9.figshare.7261598.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7261598.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kamal Choudhary; Brian DeCost; Francesca Tavazza; Hacking Materials
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Various properties of 24,759 bulk and 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database. This dataset was modified from the JARVIS ML training set developed by NIST (1-2). The custom descriptors have been removed, the column naming scheme revised, and a composition column created. This leaves the training set as a dataset of composition and structure descriptors mapped to a diverse set of materials properties.Available as Monty Encoder encoded JSON and as the source Monty Encoder encoded JSON file. Recommended access method is with the matminer Python package using the datasets module.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset discussed in: Machine learning with force-field-inspired descriptors for materials: Fast screening and mapping energy landscape Kamal Choudhary, Brian DeCost, and Francesca Tavazza Phys. Rev. Materials 2, 083801Original Data file sourced from:choudhary, kamal (2018): JARVIS-ML-CFID-descriptors and material properties. figshare. Dataset.
o
mushroom
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Schlimmer (2014). mushroom [Dataset]. https://www.openml.org/d/24
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
Jeff Schlimmer
Description
Author: Jeff Schlimmer
Source: UCI - 1981
Please cite: The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

Description

This dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.

Source

(a) Origin: Mushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)

Dataset description

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

Attributes Information

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 4. bruises?: bruises=t,no=f 5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 6. gill-attachment: attached=a,descending=d,free=f,notched=n 7. gill-spacing: close=c,crowded=w,distant=d 8. gill-size: broad=b,narrow=n 9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 10. stalk-shape: enlarging=e,tapering=t 11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 16. veil-type: partial=p,universal=u 17. veil-color: brown=n,orange=o,white=w,yellow=y 18. ring-number: none=n,one=o,two=t 19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Relevant papers

Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine.

Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann.

Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link]

Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.
3M+ Academic Papers: Titles & Abstracts
kaggle.com
zip
Updated Sep 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Arias (2025). 3M+ Academic Papers: Titles & Abstracts [Dataset]. https://www.kaggle.com/datasets/beta3logic/3m-academic-papers-titles-and-abstracts
Explore at:
zip(1478156333 bytes)Available download formats
Dataset updated
Sep 18, 2025
Authors
David Arias
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Comprehensive Academic Papers Dataset: 3M+ Research Paper Titles and Abstracts

📋 Overview

This dataset is a comprehensive collection of over 3 million research paper titles and abstracts, curated and consolidated from multiple high-quality academic sources. The dataset provides a unified, clean, and standardized format for researchers, data scientists, and machine learning practitioners working on natural language processing, academic research analysis, and knowledge discovery tasks.

🎯 Key Features

3.6+ million scientific papers with titles and abstracts

Multi-domain coverage: Physics, Mathematics, Computer Science, Biology, Medicine, and more

Standardized format: Consistent title and abstract columns

Quality assured: Validated using Pydantic models and cleaned of duplicates/null values

Ready-to-use: Pre-processed and formatted for immediate analysis

Format: CSV

Language: English

📊 Dataset Statistics

Metric Value
Total Records ~3,000,000+
Columns 2 (title, abstract)
File Size 4.15 GB
Format CSV
Duplicates Removed
Missing Values Removed

🗂️ Dataset Structure

cleaned_papers.csv ├── title (string): Scientific paper title └── abstract (string): Scientific paper abstract

🔄 Data Processing Pipeline

The dataset underwent a rigorous cleaning and standardization process:

Data Import: Automated import from multiple sources (Kaggle API, Hugging Face)

Column Standardization: Mapping various column names to consistent title and abstract format

Data Validation: Pydantic model validation ensuring data quality

Duplicate Removal: Advanced deduplication based on title and abstract similarity

Null Value Handling: Removal of records with missing titles or abstracts

Quality Assurance: Final validation and statistics generation

💡 Use Cases

This dataset is ideal for:

Natural Language Processing: Text classification, sentiment analysis, topic modeling

Scientific Literature Analysis: Trend analysis, domain classification, citation prediction

Machine Learning Research: Training language models, text summarization, information extraction

Academic Research: Bibliometric analysis, research trend identification

Educational Applications: Building search engines, recommendation systems

🔗 Data Sources and Attribution

This dataset consolidates academic papers from the following sources:

Kaggle Datasets:

ArXiv Scientific Research Papers Dataset by @sumitm004

Cornell University ArXiv Dataset by @Cornell-University

Hugging Face Datasets:

ML-ArXiv-Papers by @CShorten

ArXiv Biology by @zeroshot

ArXiv Data Extended by @wrapper228

Stroke PubMed Abstracts by @Gaborandi

PubMed ArXiv Abstracts Data by @brainchalov

Abstracts Cleaned by @Eitanli

🔄 Update Schedule

This dataset represents a point-in-time consolidation. Future versions may include: - Additional academic sources - Extended fields (authors, publication dates, venues) - Domain-specific subsets - Enhanced metadata

📄 License and Usage

Please respect the individual licenses of the source datasets. This consolidated version is provided for research and educational purposes. When using this dataset:

Citation: Please cite this dataset and acknowledge the original data sources

Attribution: Credit the original dataset creators listed above

Compliance: Ensure compliance with individual dataset licenses

Academic Use: Primarily intended for non-commercial, academic, and research purposes

🙏 Acknowledgments

Special thanks to all the original dataset creators and the academic communities that make their research data publicly available. This work builds upon their valuable contributions to open science and knowledge sharing.

Keywords: academic papers, research abstracts, NLP, machine learning, text mining, scientific literature, ArXiv, PubMed, natural language processing, research dataset
Data sets and machine learning models for: Predicting critical properties of...
zenodo.org
bin, zip
Updated Oct 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayandeep Biswas; Yunsie Chung; Yunsie Chung; Josephine Ramirez; Haoyang Wu; Haoyang Wu; William Green; William Green; Sayandeep Biswas; Josephine Ramirez (2023). Data sets and machine learning models for: Predicting critical properties of fluids using machine learning [Dataset]. http://doi.org/10.5281/zenodo.7804143
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7804143
Dataset updated
Oct 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sayandeep Biswas; Yunsie Chung; Yunsie Chung; Josephine Ramirez; Haoyang Wu; Haoyang Wu; William Green; William Green; Sayandeep Biswas; Josephine Ramirez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The experimental data sets, data splits, additional features, QM calculations, model predictions, and final machine learning models for the manuscript "Predicting critical properties of fluids using multi-task machine learning". Citation should refer directly to the manuscript. (citation will be added soon)

To use the machine learning models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop.

Detailed information can be found in README.md file.

Details on the properties considered

The data set includes the following 8 properties:

Tc: critical temperature, in K

Pc: critical pressure, in bar

rhoc: critical density, in mol/L

omega: acentric factor, unitless

Tb: boiling point, in K

Tm: melting point, in K

dHvap: enthalpy of vaporization at boiling point, in kJ/mol

dHfus: enthalpy of fusion at melting point, in kJ/mol

Details on the files

1. Data sets under CritProp_v1.0.0:

all_data: includes the data sets used in this work. All data points are listed for each chemical compound as well as its corresponding data source. The details of the data sources can be found in the README.md file. The distribution of the data set is included in each folder.

estimated_data_for_pretraining: contains the estimated data from Yaws' handbook that are used to pre-train our machine learning (ML) model.

experimental_data: contains the experimental data used to fine-tune our ML model.

additional_features: includes the additional features tested for the ML model.

abraham: Abraham solute parameters (E, S, A, B, L). Molecular features.

acsf: ACSF (atom-centered symmetry functions). Atomic features that are coverted from the 3D coordinates of the compound

qm_atom: QM (quantum chemical) atomic feature.

qm_mol: QM molecular feature.

rdkit: Selected RDKit 2D molecular features.

data_splits_and_model_predictions: contains the training set and test set used to for random and scaffold splits. It also contains the predicted values from our final ML model for each test set.

2. Machine learning (ML) model files:

CritProp_ML_model_fiiles_with_abraham_feat.zip: contains the Chemprop ML model files that are trained using Abraham features as additional molecular features. This gives the best results.

CritProp_ML_model_fiiles_without_additional_feat.zip: contains the Chemprop ML model files that are trained without any additional features. This gives the second best results.

To use these ML models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop

3. QM (quantum chemical) calculations:

QM_calculations.zip: contains the results of the QM calculations that are performed to compute QM features.
PMOA-CITE dataset
figshare.com
zip
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TONG ZENG (2023). PMOA-CITE dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12547574.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12547574.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
TONG ZENG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"There are one million sentences in total, and further splitted into trainning, validation and testing by 60%, 20% and 20%, respectively.For the pre-processing of the dataset, please refer to the paper.The data are stored in jsonl format (each row is an json object), we list a couple of rows as example:{"sec_name":"introduction","cur_sent_id":"12213838@0#3$0","next_sent_id":"12213838@0#3$1","cur_sent":"All three spectrin subunits are essential for normal development.","next_sent":"βH, encoded by the karst locus, is an essential protein that is required for epithelial morphogenesis .","cur_scaled_len_features":{"type":1,"values":[0.17716535433070865,0.13513513513513514]},"next_scaled_len_features":{"type":1,"values":[0.32677165354330706,0.35135135135135137]},"cur_has_citation":0,"next_has_citation":1}{"sec_name":"results","prev_sent_id":"12230634@1@1#0$2","cur_sent_id":"12230634@1@1#0$3","next_sent_id":"12230634@1@1#0$4","prev_sent":"μIU/ml at the 2.0-h postprandial time point.","cur_sent":"Statistically significant differences between the mean plasma insulin levels of dogs treated with 50 mg/kg of GSNO, and those treated with 50 mg/kg GSNO and vitamin C (50 mg/kg) were observed at the 1.0-h and 1.5-h time points (P < 0.05).","next_sent":"The mean plasma insulin concentrations in the dogs treated with 50 mg/kg of vitamin C and 50 mg/kg of GSNO, or 50 mg/kg of GSNO was significantly altered compared to those of controls or captopril-treated dogs (P < 0.05).","prev_scaled_len_features":{"type":1,"values":[0.09448818897637795,0.08108108108108109]},"cur_scaled_len_features":{"type":1,"values":[0.8582677165354331,1.0]},"next_scaled_len_features":{"type":1,"values":[0.7913385826771654,0.9459459459459459]},"prev_has_citation":0,"cur_has_citation":0,"next_has_citation":0}{"sec_name":"results","prev_sent_id":"12213837@1@0#3$3","cur_sent_id":"12213837@1@0#3$4","next_sent_id":"12213837@1@0#3$5","prev_sent":"Cleavage of VAMP2 by BoNT/D releases the NH2-terminal 59 amino acids from the protein and eliminates exocytosis.","cur_sent":"However, in this case, exocytosis cannot be recovered by addition of the cleaved fragment .","next_sent":"Peptides that exactly correspond to the BoNT/D cleavage site (VAMP2 aa 25–59 and 60–94-cys) were equally efficient at mediating liposome fusion (unpublished data).","prev_scaled_len_features":{"type":1,"values":[0.36220472440944884,0.35135135135135137]},"cur_scaled_len_features":{"type":1,"values":[0.2795275590551181,0.2972972972972973]},"next_scaled_len_features":{"type":1,"values":[0.562992125984252,0.5135135135135135]},"prev_has_citation":0,"cur_has_citation":1,"next_has_citation":0}For the code using this dataset to modeling citation worthiness, please refer to https://github.com/sciosci/cite-worthiness

Metric	Value
Total Records	~3,000,000+
Columns	2 (`title`, `abstract`)
File Size	4.15 GB
Format	CSV
Duplicates	Removed
Missing Values	Removed

arrhythmia

openml.org

Updated Apr 6, 2014

Facebook

Twitter

Click to copy link

Link copied

Cite

H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu (2014). arrhythmia [Dataset]. https://www.openml.org/d/5

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 6, 2014

Authors

H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu

Description

Author: H. Altay Guvenir, Burak Acar, Haldun Muderrisoglu
Source: UCI
Please cite: UCI

Cardiac Arrhythmia Database
The aim is to determine the type of arrhythmia from the ECG recordings. This database contains 279 attributes, 206 of which are linear valued and the rest are nominal.

Concerning the study of H. Altay Guvenir: "The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes, 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However, there are differences between the cardiologist's and the program's classification. Taking the cardiologist's as a gold standard we aim to minimize this difference by means of machine learning tools.

The names and id numbers of the patients were recently removed from the database.

Attribute Information

  1 Age: Age in years , linear
  2 Sex: Sex (0 = male; 1 = female) , nominal
  3 Height: Height in centimeters , linear
  4 Weight: Weight in kilograms , linear
  5 QRS duration: Average of QRS duration in msec., linear
  6 P-R interval: Average duration between onset of P and Q waves
   in msec., linear
  7 Q-T interval: Average duration between onset of Q and offset
   of T waves in msec., linear
  8 T interval: Average duration of T wave in msec., linear
  9 P interval: Average duration of P wave in msec., linear
 Vector angles in degrees on front plane of:, linear
 10 QRS
 11 T
 12 P
 13 QRST
 14 J
 15 Heart rate: Number of heart beats per minute ,linear
 Of channel DI:
  Average width, in msec., of: linear
  16 Q wave
  17 R wave
  18 S wave
  19 R' wave, small peak just after R
  20 S' wave
  21 Number of intrinsic deflections, linear
  22 Existence of ragged R wave, nominal
  23 Existence of diphasic derivation of R wave, nominal
  24 Existence of ragged P wave, nominal
  25 Existence of diphasic derivation of P wave, nominal
  26 Existence of ragged T wave, nominal
  27 Existence of diphasic derivation of T wave, nominal
 Of channel DII: 
  28 .. 39 (similar to 16 .. 27 of channel DI)
 Of channels DIII:
  40 .. 51
 Of channel AVR:
  52 .. 63
 Of channel AVL:
  64 .. 75
 Of channel AVF:
  76 .. 87
 Of channel V1:
  88 .. 99
 Of channel V2:
  100 .. 111
 Of channel V3:
  112 .. 123
 Of channel V4:
  124 .. 135
 Of channel V5:
  136 .. 147
 Of channel V6:
  148 .. 159
 Of channel DI:
  Amplitude , * 0.1 milivolt, of
  160 JJ wave, linear
  161 Q wave, linear
  162 R wave, linear
  163 S wave, linear
  164 R' wave, linear
  165 S' wave, linear
  166 P wave, linear
  167 T wave, linear
  168 QRSA , Sum of areas of all segments divided by 10,
    ( Area= width * height / 2 ), linear
  169 QRSTA = QRSA + 0.5 * width of T wave * 0.1 * height of T
    wave. (If T is diphasic then the bigger segment is
    considered), linear
 Of channel DII:
  170 .. 179
 Of channel DIII:
  180 .. 189
 Of channel AVR:
  190 .. 199
 Of channel AVL:
  200 .. 209
 Of channel AVF:
  210 .. 219
 Of channel V1:
  220 .. 229
 Of channel V2:
  230 .. 239
 Of channel V3:
  240 .. 249
 Of channel V4:
  250 .. 259
 Of channel V5:
  260 .. 269
 Of channel V6:
  270 .. 279

Class code - class - number of instances:

  01       Normal        245
  02       Ischemic changes (Coronary Artery Disease)  44
  03       Old Anterior Myocardial Infarction      15
  04       Old Inferior Myocardial Infarction      15
  05       Sinus tachycardy    13
  06       Sinus bradycardy    25
  07       Ventricular Premature Contraction (PVC)    3
  08       Supraventricular Premature Contraction    2
  09       Left bundle branch block     9 
  10       Right bundle branch block    50
  11       1. degree AtrioVentricular block    0 
  12       2. degree AV block        0
  13       3. degree AV block        0
  14       Left ventricule hypertrophy        4
  15       Atrial Fibrillation or Flutter        5
  16       Others         22

d
Galilee subregion groundwater usage estimates dataset v01
data.gov.au
researchdata.edu.au
zip
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). Galilee subregion groundwater usage estimates dataset v01 [Dataset]. https://data.gov.au/data/dataset/068065ad-b7ac-4197-8837-2d362770017a
Explore at:
zip(14200605)Available download formats
Dataset updated
Apr 13, 2022
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Galilee
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

Do not use or publish - some source data used to create this version of the Galilee subregion groundwater usage estimates cannot be used and/or published by the Bioregional Assessments team due to licencing permissions. The restricted source data is not saved within the BA repository and is therefore not linked in the lineage. This version of the dataset has been replaced with the version 2 dataset which has had the restricted source data removed: Galilee subregion groundwater usage estimates dataset v02 (GUID: 339532fb-2ba6-424a-87fb-70d35df12abf)

This dataset was created to provide an estimate of yearly groundwater use from active bores in the Galilee subregion.

Dataset History

The majority of bores in the Galilee subregion do not operate under a groundwater licensing arrangement. Thus, summarising groundwater licence allocations can underestimate the annual groundwater withdrawals from aquifer systems. A dataset for estimating yearly groundwater use from all bores, in ML/year, was compiled using the following steps:

compile a list of all bores in the Galilee subregion using data available from the Queensland groundwater database

for each bore, where data are available incorporate interpreted stratigraphic picks for screened intervals into the dataset

for each bore, where data are available, incorporate the most recent standing water level and bore maximum discharge data into the dataset. Maximum discharge will need to be re-calculated from L/second to ML/year

from the water licence dataset, incorporate the licensed water allocation volume, bore use and GMA information

incorporate relevant Great Artesian Basin Sustainability Initiative (GABSI) data into the dataset and ensure that information in GABSI dataset is accurately reflected by the bore facility status records

investigate the bore facility status records. Bore facility status categories include: existing; abandoned but usable; abandoned and destroyed and proposed. Only those classed as 'existing' or 'abandoned but usable' were kept in the dataset. It is assumed that bores in other categories are not functional

interrogate bore use records. Remove any bore from the dataset that is tagged as a monitoring bore. It is assumed that monitoring bores are not being used for any purpose other than groundwater monitoring

insert two new blank columns, 'BA groundwater usage' and 'groundwater use source' in the dataset. The 'BA groundwater usage' column is where the estimate for annual groundwater usage is recorded for a bore in ML/year. The 'groundwater use source' column is where the decision on how yearly groundwater usage is assigned is recorded

populate the 'BA groundwater usage' and 'groundwater use source' columns.

Queensland Government (Queensland Government, 2014, pers. comm.) provides some information on the estimation of annual water usage for groundwater bores in Queensland. Some steps to determining an estimate of groundwater usage for each bore are as follows:

populate the BA groundwater usage column with water licence allocations that are greater than 0 ML/year. While the full allocation may not actually be used, this will provide a maximum allowable water allocation that could be pumped from a particular area. This has the potential to conserve the unused allocations when estimating groundwater usage for an area

sub-artesian bores - Queensland Government (Queensland Government, 2014, pers. comm.) suggests 5 ML/year or bore maximum flow rate, whichever is least

controlled Artesian Bore - Queensland Government (Queensland Government, 2014, pers. comm.) suggests 30 ML/year or the maximum flow rate, whichever is least

uncontrolled Artesian Bore - Queensland Government (Queensland Government, 2014, pers. comm.) suggests use the flow rate in ML/year

uncontrolled artesian bores missing flow rate and standing water level information - the average flow rate for all uncontrolled artesian bores located within the Galilee subregion was calculated from existing data. The average was then assigned as nominal value for uncontrolled artesian bores with no flow rate data. For Galilee subregion this equated to 124 ML/year.

Dataset Citation

Geoscience Australia (XXXX) Galilee subregion groundwater usage estimates dataset v01. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/068065ad-b7ac-4197-8837-2d362770017a.

Dataset Ancestors

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204

Derived From QLD Department of Natural Resources and Mines Groundwater Database Extract 20142808

Data from: Annotated dataset of simulated voiding sound for urine flow...

springernature.figshare.com
portaldelaciencia.uva.es

application/x-rar

Updated Jun 13, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Marcos Lazaro Alvarez; Laura Arjona; Alfonso Bahillo; Ganeko Bernardo (2025). Annotated dataset of simulated voiding sound for urine flow estimation [Dataset]. http://doi.org/10.6084/m9.figshare.27606642.v1

Explore at:

application/x-rarAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.27606642.v1

Dataset updated

Jun 13, 2025

Dataset provided by

figshare

Authors

Marcos Lazaro Alvarez; Laura Arjona; Alfonso Bahillo; Ganeko Bernardo

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Annotated dataset of simulated voiding sound for urine flow estimation

Overview

This repository contains a dataset of synthetic urination audio signals generated under controlled conditions. The dataset is intended for research in sound-based uroflowmetry (SU) and the development of machine learning models for voiding flow estimation. The audio samples were generated using a high-precision peristaltic pump to simulate flow rates between 1 and 50 ml/s and were recorded with three different microphone devices.

Experimental Setup

The audio recordings were conducted in a bathroom setup where the synthetic urine stream generated by the peristaltic pump was directed into a standard ceramic toilet containing a fixed volume of water at the bottom, ensuring that the sound was produced by the interaction of the liquid stream with the water surface. This configuration mimics realistic voiding conditions and allows the captured audio to resemble actual urination events.

Dataset Structure

Each audio file is a 60-second segment labeled with its corresponding flow rate (in ml/s). The naming convention is:

[device]_f_[flow]_[duration]s.wav

device: UM (Ultramic384k), Phone (Mi A1), or Watch (Oppo Smartwatch)
flow: flow rate from 1 to 50 ml/s
duration: fixed to 60 seconds for all files

Example:

um_f_20_60s.wav phone_f_45_60s.wav watch_f_10_60s.wav

Silence Reference Recordings

In addition to the voiding audio files, each device folder includes a 30-second silence recording. These recordings were captured in the same environment and setup, but without any synthetic flow, allowing for baseline noise analysis. They serve as a reference to evaluate the background noise characteristics of each device and to support preprocessing techniques such as noise reduction or signal enhancement.

Filename format:

[device]_f_0_30s.wav

Example:

um_f_0s.wav
phone_f_0s.wav
oppo_f_0s.wav

Purpose

The goal of this dataset is to provide a standardized audio repository for the development, training and validation of machine learning algorithms for voiding flow prediction. This enables researchers to: - Benchmark different approaches on a common dataset - Develop flow estimation models using synthetic audio before transferring them to real-world applications - Explore the spectral and temporal structure of urination-related audio signals

Flow Generation

Pump Used: L600-1F precision peristaltic pump
Flow Range: 1–50 ml/s (based on ICS-reported ranges for male uroflowmetry)
Calibration: Pump flows were validated using a graduated cylinder
Noise Isolation: The pump was placed in a separate room (via 15m silicone tubing) to eliminate pump noise from recordings

Recording Devices

Device	Sampling Rate	Frequency Range	Description
UM	192 kHz	0–96 kHz	High-quality ultrasonic microphone
Phone	48 kHz	0–24 kHz	Android smartphone (Mi A1)
Watch	44.1 kHz	0–22.05 kHz	Oppo Smartwatch with built-in mic

Each recording was carried out using a custom mobile or desktop app with preset parameters.

Recording Environment

Recordings were made in a bathroom with a standard ceramic toilet containing water at the bottom.
The nozzle height varied between 73–86 cm depending on flow rate to ensure consistent water impact.
Microphone heights:
- UM: 84 cm
- Phone: 95 cm
- Watch: 86 cm (simulating wrist height)

Data Collection Protocol

Pump activated with flow set from 1 to 50 ml/s.
Audio recorded simultaneously with UM, Phone and Watch for 80 seconds.
Initial 15 seconds and final 5 seconds trimmed to retain 60 seconds of steady-state urination sound.

Citation

If you use this dataset in your work, please cite the associated paper:

M. L. Alvarez et al., “Annotated dataset of simulated voiding sound for urine flow estimation”, 2025. (Pending publication)

License

This dataset is made available for research purposes under a CC BY license.

Contact

For questions, please contact:

Marcos Lazaro Alvarez Faculty of Engineering, University of Deusto
alvarez.marcoslazaro@deusto.es

o
kr-vs-kp
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alen Shapiro (2014). kr-vs-kp [Dataset]. https://www.openml.org/d/3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
Alen Shapiro
Description
Author: Alen Shapiro Source: UCI Please cite: UCI citation policy

Title: Chess End-Game -- King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7). The pawn on a7 means it is one square away from queening. It is the King+Rook's side (white) to move.

Sources: (a) Database originally generated and described by Alen Shapiro. (b) Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk). (c) Date: 1 August 1989

Past Usage:

Alen D. Shapiro (1983,1987), "Structured Induction in Expert Systems", Addison-Wesley. This book is based on Shapiro's Ph.D. thesis (1983) at the University of Edinburgh entitled "The Role of Structured Induction in Expert Systems".

Stephen Muggleton (1987), "Structuring Knowledge by Asking Questions", pp.218-229 in "Progress in Machine Learning", edited by I. Bratko and Nada Lavrac, Sigma Press, Wilmslow, England SK9 5BB.

Robert C. Holte, Liane Acker, and Bruce W. Porter (1989), "Concept Learning and the Problem of Small Disjuncts", Proceedings of IJCAI. Also available as technical report AI89-106, Computer Sciences Department, University of Texas at Austin, Austin, Texas 78712.

Relevant Information: The dataset format is described below. Note: the format of this database was modified on 2/26/90 to conform with the format of all the other databases in the UCI repository of machine learning databases.

Number of Instances: 3196 total

Number of Attributes: 36

Attribute Summaries: Classes (2): -- White-can-win ("won") and White-cannot-win ("nowin"). I believe that White is deemed to be unable to win if the Black pawn can safely advance. Attributes: see Shapiro's book.

Missing Attributes: -- none

Class Distribution: In 1669 of the positions (52%), White can win. In 1527 of the positions (48%), White cannot win.

The format for instances in this database is a sequence of 37 attribute values. Each instance is a board-descriptions for this chess endgame. The first 36 attributes describe the board. The last (37th) attribute is the classification: "win" or "nowin". There are 0 missing values. A typical board-description is

f,f,f,f,f,f,f,f,f,f,f,f,l,f,n,f,f,t,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,t,t,n,won

The names of the features do not appear in the board-descriptions. Instead, each feature correponds to a particular position in the feature-value list. For example, the head of this list is the value for the feature "bkblk". The following is the list of features, in the order in which their values appear in the feature-value list:

[bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd, hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr, skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg]

In the file, there is one instance (board position) per line.

Num Instances: 3196 Num Attributes: 37 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 37 Missing values: 0 / 0.0%

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

zenodo.org
data.niaid.nih.gov
+1more

Updated Aug 30, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. http://doi.org/10.5281/zenodo.8116338

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.8116338

Dataset updated

Aug 30, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Article Information

The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

Please do cite the aforementioned article when using this dataset.

Abstract

The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

ZIP Folder Content

The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

Datasets' Content

Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

Identified Key Features Within Bluetooth Dataset

Feature	Meaning
btle.advertising_header	BLE Advertising Packet Header
btle.advertising_header.ch_sel	BLE Advertising Channel Selection Algorithm
btle.advertising_header.length	BLE Advertising Length
btle.advertising_header.pdu_type	BLE Advertising PDU Type
btle.advertising_header.randomized_rx	BLE Advertising Rx Address
btle.advertising_header.randomized_tx	BLE Advertising Tx Address
btle.advertising_header.rfu.1	Reserved For Future 1
btle.advertising_header.rfu.2	Reserved For Future 2
btle.advertising_header.rfu.3	Reserved For Future 3
btle.advertising_header.rfu.4	Reserved For Future 4
btle.control.instant	Instant Value Within a BLE Control Packet
btle.crc.incorrect	Incorrect CRC
btle.extended_advertising	Advertiser Data Information
btle.extended_advertising.did	Advertiser Data Identifier
btle.extended_advertising.sid	Advertiser Set Identifier
btle.length	BLE Length
frame.cap_len	Frame Length Stored Into the Capture File
frame.interface_id	Interface ID
frame.len	Frame Length Wire
nordic_ble.board_id	Board ID
nordic_ble.channel	Channel Index
nordic_ble.crcok	Indicates if CRC is Correct
nordic_ble.flags	Flags
nordic_ble.packet_counter	Packet Counter
nordic_ble.packet_time	Packet time (start to end)
nordic_ble.phy	PHY
nordic_ble.protover	Protocol Version

Identified Key Features Within IP-Based Packets Dataset

Feature	Meaning
http.content_length	Length of content in an HTTP response
http.request	HTTP request being made
http.response.code	Sequential number of an HTTP response
http.response_number	Sequential number of an HTTP response
http.time	Time taken for an HTTP transaction
tcp.analysis.initial_rtt	Initial round-trip time for TCP connection
tcp.connection.fin	TCP connection termination with a FIN flag
tcp.connection.syn	TCP connection initiation with SYN flag
tcp.connection.synack	TCP connection establishment with SYN-ACK flags
tcp.flags.cwr	Congestion Window Reduced flag in TCP
tcp.flags.ecn	Explicit Congestion Notification flag in TCP
tcp.flags.fin	FIN flag in TCP
tcp.flags.ns	Nonce Sum flag in TCP
tcp.flags.res	Reserved flags in TCP
tcp.flags.syn	SYN flag in TCP
tcp.flags.urg	Urgent flag in TCP
tcp.urgent_pointer	Pointer to urgent data in TCP
ip.frag_offset	Fragment offset in IP packets
eth.dst.ig	Ethernet destination is in the internal network group
eth.src.ig	Ethernet source is in the internal network group
eth.src.lg	Ethernet source is in the local network group
eth.src_not_group	Ethernet source is not in any network group
arp.isannouncement	Indicates if an ARP message is an announcement

Identified Key Features Within IP-Based Flows Dataset

Feature	Meaning
proto	Transport layer protocol of the connection
service	Identification of an application protocol
orig_bytes	Originator payload bytes
resp_bytes	Responder payload bytes
history	Connection state history
orig_pkts	Originator sent packets
resp_pkts	Responder sent packets
flow_duration	Length of the flow in seconds
fwd_pkts_tot	Forward packets total
bwd_pkts_tot	Backward packets total
fwd_data_pkts_tot	Forward data packets total
bwd_data_pkts_tot	Backward data packets total
fwd_pkts_per_sec	Forward packets per second
bwd_pkts_per_sec	Backward packets per second
flow_pkts_per_sec	Flow packets per second
fwd_header_size	Forward header bytes
bwd_header_size	Backward header bytes
fwd_pkts_payload	Forward payload bytes
bwd_pkts_payload	Backward payload bytes
flow_pkts_payload	Flow payload bytes
fwd_iat	Forward inter-arrival time
bwd_iat	Backward inter-arrival time
flow_iat	Flow inter-arrival time
active	Flow active duration

Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT...
zenodo.org
bin, text/x-python
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Ganscha; Stefan Ganscha; Oliver T. Unke; Oliver T. Unke; Daniel Ahlin; Daniel Ahlin; Hartmut Maennel; Hartmut Maennel; Sergii Kashubin; Sergii Kashubin; Klaus-Robert Mueller; Klaus-Robert Mueller (2025). Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations [Dataset]. http://doi.org/10.5281/zenodo.14859804
Explore at:
text/x-python, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14859804
Dataset updated
Mar 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stefan Ganscha; Stefan Ganscha; Oliver T. Unke; Oliver T. Unke; Daniel Ahlin; Daniel Ahlin; Hartmut Maennel; Hartmut Maennel; Sergii Kashubin; Sergii Kashubin; Klaus-Robert Mueller; Klaus-Robert Mueller
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
2024
Description
Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g. Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations.

The data is available as TensorFlow dataset (TFDS) and can be accessed from the publicly available Google Cloud Storage at gs://qcml-datasets/tfds/. (See "Directory structure" below.)

For information on different access options (command-line tools, client libraries, etc), please see https://cloud.google.com/storage/docs/access-public-data.

Storage API: Using the "directory structure" and "builder configurations" below, storage API links can be constructed, e.g.
https://storage.mtls.cloud.google.com/qcml-datasets/tfds/qcml/dft_metadata/1.0.0/qcml-full.tfrecord-00010-of-00011

gcloud: Our example_usage.py uses the gcloud command-line tool for download.

Web access via Google Cloud Console is possible for any authenticated Cloud user: https://console.cloud.google.com/storage/browser/qcml-datasets/.

Directory structure

gs://qcml-datasets (GCS Bucket)

tfds (TFDS data directory)

qcml (TFDS dataset name)

dft_atomic_numbers (TFDS builder config name)

1.0.0 (Current version)

dataset_info.json

features.json

qcml-full.tfrecord-X-of-Y (TFDS data shards, see below)

...

dft_positions

xtb_all

Builder configurations

Format: Builder config name: number of shards (rounded total size)

Semi-empirical calculations:

xtb_all: 85000 (69 TB)

DFT calculations:

dft_atomic_numbers: 11 (3 GB)

dft_d4_atomic_charges: 11 (4 GB)

dft_d4_c6_coefficients: 11 (4 GB)

dft_d4_correction: 11 (8 GB)

dft_d4_energy: 11 (2 GB)

dft_d4_forces: 11 (7 GB)

dft_d4_polarizabilities: 11 (4 GB)

dft_force_field: 11 (18 GB)

dft_force_field_d4: 110 (24 GB)

dft_force_field_mbd: 110 (24 GB)

dft_gfn0_dipole: 11 (3 GB)

dft_gfn0_eeq_charges: 11 (4 GB)

dft_gfn0_energy: 11 (2 GB)

dft_gfn0_forces: 11 (7 GB)

dft_gfn0_formation_energy: 11 (3 GB)

dft_gfn0_orbital_energies_a: 11 (8 GB)

dft_gfn0_orbital_occupations_a: 11 (8 GB)

dft_gfn0_wiberg_bond_orders: 110 (29 GB)

dft_gfn2_dipole: 11 (3 GB)

dft_gfn2_energy: 11 (2 GB)

dft_gfn2_forces: 11 (7 GB)

dft_gfn2_formation_energy: 11 (3 GB)

dft_gfn2_mulliken_charges: 11 (4 GB)

dft_gfn2_orbital_energies_a: 11 (7 GB)

dft_gfn2_orbital_occupations_a: 11 (7 GB)

dft_gfn2_wiberg_bond_orders: 110 (29 GB)

dft_is_outlier: 11 (2 GB)

dft_mbd_c6_coefficients: 11 (4 GB)

dft_mbd_correction: 11 (8 GB)

dft_mbd_energy: 11 (2 GB)

dft_mbd_forces: 11 (7 GB)

dft_mbd_polarizabilities: 11 (4 GB)

dft_metadata: 11 (11 GB)

dft_multipole_moments: 11 (8 GB)

dft_pbe0_core_hamiltonian_matrix: 110000 (30 TB)

dft_pbe0_density_matrix_a: 110000 (30 TB)

dft_pbe0_density_matrix_b: 110000 (3 TB)

dft_pbe0_dipole: 11 (3 GB)

dft_pbe0_electronic_free_energy: 11 (3 GB)

dft_pbe0_energy: 11 (2 GB)

dft_pbe0_forces: 11 (7 GB)

dft_pbe0_formation_energy: 11 (3 GB)

dft_pbe0_grid_density_a: 110000 (27 TB)

dft_pbe0_grid_density_b: 110000 (3 TB)

dft_pbe0_grid_density_gradient_a: 110000 (81 TB)

dft_pbe0_grid_density_gradient_b: 110000 (10 TB)

dft_pbe0_grid_density_laplacian_a: 110000 (27 TB)

dft_pbe0_grid_density_laplacian_b: 110000 (3 TB)

dft_pbe0_grid_kinetic_energy_density_a: 110000 (27 TB)

dft_pbe0_grid_kinetic_energy_density_b: 110000 (3 TB)

dft_pbe0_grid_points: 110000 (81 TB)

dft_pbe0_grid_weight: 110000 (27 TB)

dft_pbe0_guid: 11 (3 GB)

dft_pbe0_hamiltonian_matrix_a: 110000 (30 TB)

dft_pbe0_hamiltonian_matrix_b: 110000 (3 TB)

dft_pbe0_has_equal_a_b_electrons: 11 (3 GB)

dft_pbe0_hexadecapole: 11 (3 GB)

dft_pbe0_hirshfeld_charges: 11 (4 GB)

dft_pbe0_hirshfeld_dipoles: 11 (8 GB)

dft_pbe0_hirshfeld_quadrupoles: 11 (11 GB)

dft_pbe0_hirshfeld_spins: 11 (3 GB)

dft_pbe0_hirshfeld_volume_ratios: 11 (4 GB)

dft_pbe0_hirshfeld_volumes: 11 (4 GB)

dft_pbe0_loewdin_charges: 11 (4 GB)

dft_pbe0_loewdin_spins: 11 (3 GB)

dft_pbe0_mulliken_charges: 11 (4 GB)

dft_pbe0_mulliken_spins: 11 (3 GB)

dft_pbe0_num_scf_iterations: 11 (3 GB)

dft_pbe0_octupole: 11 (3 GB)

dft_pbe0_orbital_coefficients_a: 110000 (30 TB)

dft_pbe0_orbital_coefficients_b: 110000 (3 TB)

dft_pbe0_orbital_energies_a: 110 (44 GB)

dft_pbe0_orbital_energies_b: 11 (8 GB)

dft_pbe0_orbital_occupations_a: 110 (44 GB)

dft_pbe0_orbital_occupations_b: 11 (8 GB)

dft_pbe0_overlap_matrix: 110000 (30 TB)

dft_pbe0_quadrupole: 11 (3 GB)

dft_pbe0_zero_broadening_corrected_energy: 11 (3 GB)

dft_population_analysis: 11 (19 GB)

dft_positions: 11 (7 GB)
PineTime heart rate dataset
zenodo.org
zip
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska (2023). PineTime heart rate dataset [Dataset]. http://doi.org/10.5281/zenodo.8220127
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8220127
Dataset updated
Aug 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of heart rate measurements collected from the PineTime wristband, with a gold standard reference.

Contents

The repository contains both the raw and the "merged", clean data. The merged data is much easier to work with and should be used when building machine learning models. The raw data is provided for transparency, reproducibility, and to allow for studies that could use the other data collected from the Equivital device.

schedule.md – schedule of the study, indicating the start and end times of each exercise and break.

data_raw/ – raw data collected from the PineTime wristband and the Equivital device. Each subdirectory corresponds to one participant. The files are in the Feather format.

data_merged/ – merged data series that can be used for building ML models. The files are in JSON format and follow a nested structure, where each heart rate measurement is associated with a series of acceleration measurements that preceded it. Each file corresponds to one continuous measurement session – there are sometimes multiple sessions per participant due to intermittent hardware failures.

Citation

If you use this data in research works, please cite the following paper:

Sowiński, P., Rachwał, K., Danilenka, A., Bogacka, K., Kobus, M., Dąbrowska, A., Paszkiewicz, A., et al. (2023). Frugal Heart Rate Correction Method for Scalable Health and Safety Monitoring in Construction Sites. Sensors, 23(14), 6464. MDPI AG. Retrieved from http://dx.doi.org/10.3390/s23146464

BibTeX:

@article{sowinski2023frugal, title={Frugal Heart Rate Correction Method for Scalable Health and Safety Monitoring in Construction Sites}, author={Sowi{\'n}ski, Piotr and Rachwa{\l}, Kajetan and Danilenka, Anastasiya and Bogacka, Karolina and Kobus, Monika and D{\k{a}}browska, Anna and Paszkiewicz, Andrzej and Bolanowski, Marek and Ganzha, Maria and Paprzycki, Marcin}, journal={Sensors}, volume={23}, number={14}, pages={6464}, year={2023}, publisher={MDPI}, url = {https://www.mdpi.com/1424-8220/23/14/6464}, doi = {10.3390/s23146464} }

Authors

Monika Kobus – data collection

Anna Dąbrowska – data collection, methodological supervision

Piotr Sowiński – data collection and processing

Acknowledgements

This work is part of the ASSIST-IoT project that has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreement No 957258.

The Central Institute for Labour Protection – National Research Institute provided facilities and equipment for data collection.

License

The dataset is licensed under the Creative Commons Attribution 4.0 International License.
E-Commerce Customer Behavior & Sales Analysis -TR
kaggle.com
zip
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UmutUygurr (2025). E-Commerce Customer Behavior & Sales Analysis -TR [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/e-commerce-customer-behavior-and-sales-analysis-tr
Explore at:
zip(138245 bytes)Available download formats
Dataset updated
Oct 29, 2025
Authors
UmutUygurr
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🛒 E-Commerce Customer Behavior and Sales Dataset 📊 Dataset Overview This comprehensive dataset contains 5,000 e-commerce transactions from a Turkish online retail platform, spanning from January 2023 to March 2024. The dataset provides detailed insights into customer demographics, purchasing behavior, product preferences, and engagement metrics.

🎯 Use Cases This dataset is perfect for:

Customer Segmentation Analysis: Identify distinct customer groups based on behavior Sales Forecasting: Predict future sales trends and patterns Recommendation Systems: Build product recommendation engines Customer Lifetime Value (CLV) Prediction: Estimate customer value Churn Analysis: Identify customers at risk of leaving Marketing Campaign Optimization: Target customers effectively Price Optimization: Analyze price sensitivity across categories Delivery Performance Analysis: Optimize logistics and shipping 📁 Dataset Structure The dataset contains 18 columns with the following features:

Order Information Order_ID: Unique identifier for each order (ORD_XXXXXX format) Date: Transaction date (2023-01-01 to 2024-03-26) Customer Demographics Customer_ID: Unique customer identifier (CUST_XXXXX format) Age: Customer age (18-75 years) Gender: Customer gender (Male, Female, Other) City: Customer city (10 major Turkish cities) Product Information Product_Category: 8 categories (Electronics, Fashion, Home & Garden, Sports, Books, Beauty, Toys, Food) Unit_Price: Price per unit (in TRY/Turkish Lira) Quantity: Number of units purchased (1-5) Transaction Details Discount_Amount: Discount applied (if any) Total_Amount: Final transaction amount after discount Payment_Method: Payment method used (5 types) Customer Behavior Metrics Device_Type: Device used for purchase (Mobile, Desktop, Tablet) Session_Duration_Minutes: Time spent on website (1-120 minutes) Pages_Viewed: Number of pages viewed during session (1-50) Is_Returning_Customer: Whether customer has purchased before (True/False) Post-Purchase Metrics Delivery_Time_Days: Delivery duration (1-30 days) Customer_Rating: Customer satisfaction rating (1-5 stars) 📈 Key Statistics Total Records: 5,000 transactions Date Range: January 2023 - March 2024 (15 months) Average Transaction Value: ~450 TRY Customer Satisfaction: 3.9/5.0 average rating Returning Customer Rate: 60% Mobile Usage: 55% of transactions 🔍 Data Quality ✅ No missing values ✅ Consistent formatting across all fields ✅ Realistic data distributions ✅ Proper data types for all columns ✅ Logical relationships between features 💡 Sample Analysis Ideas Customer Segmentation with K-Means Clustering

Segment customers based on spending, frequency, and recency Sales Trend Analysis

Identify seasonal patterns and peak shopping periods Product Category Performance

Compare revenue, ratings, and return rates across categories Device-Based Behavior Analysis

Understand how device choice affects purchasing patterns Predictive Modeling

Build models to predict customer ratings or purchase amounts City-Level Market Analysis

Compare market performance across different cities 🛠️ Technical Details File Format: CSV (Comma-Separated Values) Encoding: UTF-8 File Size: ~500 KB Delimiter: Comma (,) 📚 Column Descriptions Column Name Data Type Description Example Order_ID String Unique order identifier ORD_001337 Customer_ID String Unique customer identifier CUST_01337 Date DateTime Transaction date 2023-06-15 Age Integer Customer age 35 Gender String Customer gender Female City String Customer city Istanbul Product_Category String Product category Electronics Unit_Price Float Price per unit 1299.99 Quantity Integer Units purchased 2 Discount_Amount Float Discount applied 129.99 Total_Amount Float Final amount paid 2469.99 Payment_Method String Payment method Credit Card Device_Type String Device used Mobile Session_Duration_Minutes Integer Session time 15 Pages_Viewed Integer Pages viewed 8 Is_Returning_Customer Boolean Returning customer True Delivery_Time_Days Integer Delivery duration 3 Customer_Rating Integer Satisfaction rating 5 🎓 Learning Outcomes By working with this dataset, you can learn:

Data cleaning and preprocessing techniques Exploratory Data Analysis (EDA) with Python/R Statistical analysis and hypothesis testing Machine learning model development Data visualization best practices Business intelligence and reporting 📝 Citation If you use this dataset in your research or project, please cite:

E-Commerce Customer Behavior and Sales Dataset (2024) Turkish Online Retail Platform Data (2023-2024) Available on Kaggle ⚖️ License This dataset is released under the CC0: Public Domain license. You are free to use it for any purpose.

🤝 Contribution Found any issues or have suggestions? Feel free to provide feedback!

📞 Contact For questions or collaborations, please reach out through Kaggle.

Happy Analyzing! 🚀

Keywords: e-c...
Z
2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix (2023). 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning: Slices 2,001-3,000 (reference reconstructions and segmentations) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8017611
Explore at:
Dataset updated
Sep 25, 2023
Dataset provided by
University of Manchester
Leiden University
Centrum Wiskunde & Informatica
Authors
Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains the reference reconstructions and segmentation of slices 2,001 – 3,000 from the data collection described in

Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, (74.8\mu m^2) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

Please refer to the paper for all further technical details.

The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD. The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.

The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

For more information or guidance in using the data collection, please get in touch with

Maximilian.Kiss [at] cwi.nl Felix.Lucka [at] cwi.nl
Z
2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...
data.niaid.nih.gov
Updated Sep 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix (2023). 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning: Slices 3,001-4,000 (reference reconstructions and segmentations) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8017617
Explore at:
Dataset updated
Sep 25, 2023
Dataset provided by
University of Manchester
Leiden University
Centrum Wiskunde & Informatica
Authors
Kiss, Maximilian B.; Coban, Sophia Bethany; Batenburg, K. Joost; van Leeuwen, Tristan; Lucka, Felix
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains the reference reconstructions and segmentation of slices 3,001 – 4,000 from the data collection described in

Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, (74.8\mu m^2) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

Please refer to the paper for all further technical details.

The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD. The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.

The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

For more information or guidance in using the data collection, please get in touch with

Maximilian.Kiss [at] cwi.nl Felix.Lucka [at] cwi.nl
n
Data from: Large-scale integration of single-cell transcriptomic data...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+2more
zip
Updated Dec 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration [Dataset]. http://doi.org/10.5061/dryad.t4b8gtj34
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.t4b8gtj34
Dataset updated
Dec 14, 2021
Dataset provided by
Cornell University
Authors
David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.

Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.

Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).

Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.

Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).

Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).

Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.

Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.

Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).

Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using

Facebook

Twitter

Click to copy link

Link copied

Cite

Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/

News Category Dataset

Identify the type of news based on headlines and short descriptions

Explore at:

44 scholarly articles cite this dataset (View in Google Scholar)

zip(27829769 bytes)Available download formats

Dataset updated

Sep 24, 2022

Authors

Rishabh Misra

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Context

** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

Content

Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

POLITICS: 35602
WELLNESS: 17945
ENTERTAINMENT: 17362
TRAVEL: 9900
STYLE & BEAUTY: 9814
PARENTING: 8791
HEALTHY LIVING: 6694
QUEER VOICES: 6347
FOOD & DRINK: 6340
BUSINESS: 5992
COMEDY: 5400
SPORTS: 5077
BLACK VOICES: 4583
HOME & LIVING: 4320
PARENTS: 3955

Citation

If you're using this dataset for your work, please cite the following articles:

Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

Acknowledgements

This dataset was collected from HuffPost.

Inspiration

Can you categorize news articles based on their headlines and short descriptions?
Do news articles from different categories have different writing styles?
A classifier trained on this dataset could be used on a free text to identify the type of language being used.

Want to contribute your own datasets?

If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

Other datasets

Please also checkout the following datasets collected by me:

Clear search

Close search

Google apps

Main menu

News Category Dataset

Context

Content

Citation

Acknowledgements

Inspiration

Want to contribute your own datasets?

Other datasets

Dataset for Machine Learning Assisted Citation Screening for Systematic...

PhishingWebsites

UCI ML Parkinsons dataset

Instances

Features

JARVIS ML Training Data

mushroom

Description

Source

Dataset description

Attributes Information

Relevant papers

3M+ Academic Papers: Titles & Abstracts

Comprehensive Academic Papers Dataset: 3M+ Research Paper Titles and Abstracts

📋 Overview

🎯 Key Features

📊 Dataset Statistics

🗂️ Dataset Structure

🔄 Data Processing Pipeline

💡 Use Cases

🔗 Data Sources and Attribution

Kaggle Datasets:

Hugging Face Datasets:

🔄 Update Schedule

📄 License and Usage

Data sets and machine learning models for: Predicting critical properties of...

PMOA-CITE dataset

arrhythmia

Attribute Information

Galilee subregion groundwater usage estimates dataset v01

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Data from: Annotated dataset of simulated voiding sound for urine flow...

Annotated dataset of simulated voiding sound for urine flow estimation

Overview

Experimental Setup

Dataset Structure

Example:

Silence Reference Recordings

Example:

Purpose

Flow Generation

Recording Devices

Recording Environment

Data Collection Protocol

Citation

License

Contact

kr-vs-kp

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

Article Information

Abstract

ZIP Folder Content

Datasets' Content

Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT...

PineTime heart rate dataset

E-Commerce Customer Behavior & Sales Analysis -TR

2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...

2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...

Data from: Large-scale integration of single-cell transcriptomic data...

News Category Dataset

Identify the type of news based on headlines and short descriptions

Context

Content

Citation

Acknowledgements

Inspiration

Want to contribute your own datasets?

Other datasets