58 datasets found

Z
Pubmed Journal Recommendation System dataset
data.niaid.nih.gov
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiayun Liu (2025). Pubmed Journal Recommendation System dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8386010
Explore at:
Dataset updated
Mar 25, 2025
Dataset provided by
Raúl García Castro
Manuel Castillo Cara
Jiayun Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

We extracted the journals and more information of:

Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

Dataset Components:

data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
m
AIWR Dataset
data.mendeley.com
Updated Aug 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sangdaow Noppitak (2021). AIWR Dataset [Dataset]. http://doi.org/10.17632/d73mpc529b.2
Explore at:
Unique identifier
https://doi.org/10.17632/d73mpc529b.2
Dataset updated
Aug 10, 2021
Authors
Sangdaow Noppitak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aerial Image Water Resources (AIWR) Dataset

According to the standard of land use code by fundamental geographic data set (FGDS), Thailand land use classification requires an analysis and transformation of satellite images data together with field survey data. In this article, researchers studied only land use in water bodies. The water bodies in this research can be divided into 2 levels: natural body of water (W1) artificial body of (W2) water.

The aerial image data used in this research was 1:50 meters. Every aerial image had 650x650 pixels. Those images included water bodies type W1 and W2. Ground truth of all aerial images was set for before sending it to be analyzed and interpreted by remote sensing experts. This assured that the water bodies groupings were correct. An example of ground truth, which has been checked by experts. Ground truth has been used in learning the algorithm in deep learning mode and also used in further evaluation.

The aerial images used in the experiment consists of water body: types W1 and W2. Aerial image water resources dataset, AIWR has 800 images. Data were chosen at random and divided into 3 sections: training, validation, and test set with ratio 8:1:1. Therefore, 640 aerial images were used for learning and creating the model, 80 images were used for validation, and the remaining 80 images were used for test.
P
Fine-Grained Cloud Segmentation Dataset Dataset
paperswithcode.com
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Fine-Grained Cloud Segmentation Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/fine-grained-cloud-segmentation-dataset
Explore at:
Dataset updated
Dec 8, 2024
Description
The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat 8 OLI and TIRS, covering diverse biomes. This variety supports cloud detection and removal in complex environments. The dataset includes manually generated cloud masks with pixel-level annotations for cloud shadow, clear sky, thin clouds, and cloud areas. Each scene is cropped into 512×512 pixel patches and split into training, validation, and test sets (6:2:2 ratio). It is a valuable resource for training and evaluating fine-grained cloud segmentation models across various terrains.
h
halueval-sft
huggingface.co
Updated Nov 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
halueval-sft [Dataset]. https://huggingface.co/datasets/jzjiao/halueval-sft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2023
Authors
jianzhaojiao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
HaluEval-SFT Dataset

HaluEval-SFT Dataset is derived from the HaluEval(https://github.com/RUCAIBox/HaluEval), focusing on enhancing model capabilities in recognizing hallucinations. The dataset comprises a total of 65,000 data points, partitioned into training, validation, and test sets with a ratio of 0.7/0.15/0.15, respectively.

Getting Started

from datasets import load_dataset dataset = load_dataset('jzjiao/halueval-sft', split = ["train"])

Dataset Description… See the full description on the dataset page: https://huggingface.co/datasets/jzjiao/halueval-sft.
f
Hyperparameter values of the CNN model.
plos.figshare.com
xls
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Demeke Endalie; Getamesay Haile; Wondmagegn Taye (2023). Hyperparameter values of the CNN model. [Dataset]. http://doi.org/10.1371/journal.pone.0295339.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295339.t004
Dataset updated
Dec 14, 2023
Dataset provided by
PLOS ONE
Authors
Demeke Endalie; Getamesay Haile; Wondmagegn Taye
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Idiomatic expressions are built into all languages and are common in ordinary conversation. Idioms are difficult to understand because they cannot be deduced directly from the source word. Previous studies reported that idiomatic expression affects many Natural language processing tasks in the Amharic language. However, most natural language processing models used with the Amharic language, such as machine translation, semantic analysis, sentiment analysis, information retrieval, question answering, and next-word prediction, do not consider idiomatic expressions. As a result, in this paper, we proposed a convolutional neural network (CNN) with a FastText embedding model for detecting idioms in an Amharic text. We collected 1700 idiomatic and 1600 non-idiomatic expressions from Amharic books to test the proposed model’s performance. The proposed model is then evaluated using this dataset. We employed an 80 by 10,10 splitting ratio to train, validate, and test the proposed idiomatic recognition model. The proposed model’s learning accuracy across the training dataset is 98%, and the model achieves 80% accuracy on the testing dataset. We compared the proposed model to machine learning models like K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Random Forest classifiers. According to the experimental results, the proposed model produces promising results.
f
Data Sheet 2_Development and validation of a machine learning-based risk...
frontiersin.figshare.com
docx
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Cao; Haipeng Deng; Shaoyun Liu; Xi Zeng; Yangyang Gou; Weiting Zhang; Yixinyuan Li; Hua Yang; Min Peng (2025). Data Sheet 2_Development and validation of a machine learning-based risk prediction model for stroke-associated pneumonia in older adult hemorrhagic stroke.docx [Dataset]. http://doi.org/10.3389/fneur.2025.1591570.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fneur.2025.1591570.s002
Dataset updated
Jun 18, 2025
Dataset provided by
Frontiers
Authors
Yi Cao; Haipeng Deng; Shaoyun Liu; Xi Zeng; Yangyang Gou; Weiting Zhang; Yixinyuan Li; Hua Yang; Min Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTo develop and validate a machine learning (ML)-based model for predicting stroke-associated pneumonia (SAP) risk in older adult hemorrhagic stroke patients.MethodsA retrospective collection of older adult hemorrhagic stroke patients from three tertiary hospitals in Guiyang, Guizhou Province (January 2019–December 2022) formed the modeling cohort, randomly split into training and internal validation sets (7:3 ratio). External validation utilized retrospective data from January–December 2023. After univariate and multivariate regression analyses, four ML models (Logistic Regression, XGBoost, Naive Bayes, and SVM) were constructed. Receiver operating characteristic (ROC) curves and area under the curve (AUC) were calculated for training and internal validation sets. Model performance was compared using Delong's test or Bootstrap test, while sensitivity, specificity, accuracy, precision, recall, and F1-score evaluated predictive efficacy. Calibration curves assessed model calibration. The optimal model underwent external validation using ROC and calibration curves.ResultsA total of 788 older adult hemorrhagic stroke patients were enrolled, divided into a training set (n = 462), an internal validation set (n = 196), and an external validation set (n = 130). The incidence of SAP in older adult patients with hemorrhagic stroke was 46.7% (368/788). Advanced age [OR = 1.064, 95% CI (1.024, 1.104)], smoking[OR = 2.488, 95% CI (1.460, 4.24)], low GCS score [OR = 0.675, 95% CI (0.553, 0.825)], low Braden score [OR = 0.741, 95% CI (0.640, 0.858)], and nasogastric tube [OR = 1.761, 95% CI (1.048, 2.960)] were identified as risk factors for SAP. Among the four machine learning algorithms evaluated [XGBoost, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes], the LR model demonstrated robust and consistent performance in predicting SAP among older adult patients with hemorrhagic stroke across multiple evaluation metrics. Furthermore, the model exhibited stable generalizability within the external validation cohort. Based on these findings, the LR framework was subsequently selected for external validation, accompanied by a nomogram visualization. The model achieved AUC values of 0.883 (training), 0.855 (internal validation), and 0.882 (external validation). The Hosmer-Lemeshow (H-L) test indicates that the calibration of the model is satisfactory in all three datasets, with P-values of 0.381, 0.142, and 0.066 respectively.ConclusionsThis study constructed and validated a risk prediction model for SAP in older adult patients with hemorrhagic stroke based on multi-center data. The results indicated that among the four machine learning algorithms (XGBoost, LR, SVM, and Naive Bayes), the LR model demonstrated the best and most stable predictive performance. Age, smoking, low GCS score, low Braden score, and nasogastric tube were identified as predictive factors for SAP in these patients. These indicators are easily obtainable in clinical practice and facilitate rapid bedside assessment. Through internal and external validation, the model was proven to have good generalization ability, and a nomogram was ultimately drawn to provide an objective and operational risk assessment tool for clinical nursing practice. It helps in the early identification of high-risk patients and guides targeted interventions, thereby reducing the incidence of SAP and improving patient prognosis.
h
tomato-leaves-dataset
huggingface.co
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo (2024). tomato-leaves-dataset [Dataset]. https://huggingface.co/datasets/lorenzoxi/tomato-leaves-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2024
Authors
Lorenzo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tomato Leaves Dataset

Overview

This dataset contains images of tomato leaves categorized into different classes based on the type of disease or health condition. The dataset is divided into training, validation, and test sets, with a ratio of 8:1:1. The classes include various diseases as well as healthy leaves. The dataset includes both augmented and non-augmented images.

Dataset Structure

The dataset is organized into three main splits:

train validation test… See the full description on the dataset page: https://huggingface.co/datasets/lorenzoxi/tomato-leaves-dataset.
Data from: Ensembl TSS dataset for GRCh38
zenodo.org
portalcienciaytecnologia.jcyl.es
+2more
bin
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7147597
Dataset updated
Aug 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
Synthetic Particle Image Dataset (SPID)
zenodo.org
data.niaid.nih.gov
bin
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michel Machado; Michel Machado; Douglas Rocha; Douglas Rocha (2023). Synthetic Particle Image Dataset (SPID) [Dataset]. http://doi.org/10.5281/zenodo.7935215
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7935215
Dataset updated
Nov 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michel Machado; Michel Machado; Douglas Rocha; Douglas Rocha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPID is a comprehensive dataset composed of synthetic particle image velocimetry (PIV) image pairs and their corresponding exact optical flow computations. It serves as a valuable resource for researchers and practitioners in the field. The dataset is organized into three subsets: training, validation, and test, distributed in a ratio of 70%, 15%, and 15%, respectively.
Each subset within SPID consists of an input denoted as "x", which comprises synthetic image pairs. These image pairs provide the necessary context for the optical flow computations. Additionally, an output termed "y" is provided, which represents the exact optical flow calculated for each image pair. Notably, the images within the dataset are single-channel, and the optical flow is decomposed into its u and v components.
The shape of the input subsets in SPID is given by (number of samples, number of frames, image width, image height, number of channels), representing the dimensions of the input data. On the other hand, the shape of the output subsets is given by (number of samples, velocity components, image width, image height), denoting the shape of the optical flow data.
It is important to mention that SPID dataset is a preprocessed version of the Raw Synthetic Particle Image Dataset (RSPID), ensuring improved usability and reliability. Moreover, the dataset is packaged as a NumPy compressed NPZ file, which conveniently stores the inputs and outputs as separate NumPy NPZ files with the labels train, validation and test as acess keys. This format simplifies data extraction and integration into machine learning frameworks and libraries, facilitating seamless usage of the dataset.
SPID incorporates various factors that impact PIV analysis to provide a comprehensive and realistic simulation. The dataset includes image pairs with an image width of 665 pixels and an image height of 630 pixels, ensuring a high level of detail and accuracy with an 8-bit depth. It incorporates different particle radii (1, 2, 3, and 4 pixels) and particle densities (15, 17, 20, 23, 25, and 32 particles) to capture diverse particle configurations.
To simulate real-world scenarios, SPID introduces displacement variations through the delta x factor, ranging from 0.05% to 0.25%. Noise levels (1, 5, 10, and 15) are also incorporated to mimic practical PIV measurements with varying degrees of noise. Furthermore, out-of-plane motion effects are considered with standard deviations of 0.01, 0.025, and 0.05 to assess their impact on optical flow accuracy.
The dataset covers a wide range of flow patterns encountered in fluid dynamics. It includes Rankine uniform, Rankine vortex, parabolic, stagnation, shear, and decaying vortex flows, allowing for comprehensive testing and evaluation of PIV algorithms across different scenarios.
By leveraging the SPID dataset, researchers can develop and validate PIV algorithms and techniques under various challenging conditions. Its realistic and diverse simulation of particle image velocimetry scenarios makes it an invaluable tool for advancing the field and improving the accuracy and reliability of optical flow computations.

LIAR2 Dataset

paperswithcode.com

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Cheng Xu; M-Tahar Kechadi, LIAR2 Dataset [Dataset]. https://paperswithcode.com/dataset/liar2

Explore at:

Authors

Cheng Xu; M-Tahar Kechadi

Description

The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:

Statistics	LIAR	LIAR2
Training set size	10,269	18,369
Validation set size	1,284	2,297
Testing set size	1,283	2,296
Avg. statement length (tokens)	17.9	17.7
Avg. speaker description length (tokens)	\	39.4
Avg. justification length (tokens)	\	94.4
Labels
Pants on fire	1,050	3,031
False	2,511	6,605
Barely-true	2,108	3,603
Half-true	2,638	3,709
Mostly-true	2,466	3,429
True	2,063	2,585

Ablation Experiment The LIAR2 dataset is an upgrade of the LIAR dataset, which inherits the ideas of the LIAR dataset, refines the details and architecture, and expands the size of the dataset to make it more responsive to the needs of fake news detection tasks. We believe that with the help of the LIAR2 dataset, it will be able to perform better fake news detection tasks. The analysis and baseline information about the LIAR2 dataset is provided in below.

Feature	Val. Accuracy	Val. F1-Macro	Val. F1-Micro	Test Accuracy	Test F1-Macro	Test F1-Micro	Mean
Statement	0.3174	0.1957	0.3117	0.3197	0.2380	0.3197	0.2837
Date	0.2912	0.1879	0.2912	0.3079	0.1775	0.3079	0.2606
Subject	0.3243	0.2311	0.3183	0.3267	0.2271	0.3267	0.2924
Speaker	0.3283	0.2250	0.3174	0.3310	0.2462	0.3310	0.2965
Speaker Description	0.3322	0.2444	0.3250	0.3280	0.2444	0.3280	0.3003
State Info	0.2930	0.1577	0.2950	0.2979	0.1521	0.2979	0.2489
Credibility History	0.5007	0.4696	0.4985	0.5057	0.4656	0.5057	0.4910
Context	0.2982	0.1817	0.2982	0.3132	0.1791	0.3132	0.2639
Justification	0.5964	0.5657	0.5827	0.6115	0.5968	0.6115	0.5941
All without
Statement	0.7079	0.6734	0.6822	0.7182	0.7108	0.7182	0.7018
Date	0.6931	0.6572	0.6680	0.7078	0.6993	0.7078	0.6889
Subject	0.7000	0.6579	0.6681	0.7078	0.7013	0.7078	0.6905
Speaker	0.6944	0.6648	0.6757	0.7043	0.6942	0.7043	0.6896
Speaker Description	0.6892	0.6640	0.6739	0.7169	0.7073	0.7169	0.6947
State Info	0.7074	0.6625	0.6729	0.7099	0.7016	0.7099	0.6940
Credibility History	0.6025	0.5717	0.5900	0.6185	0.6046	0.6185	0.6010
Context	0.7005	0.6622	0.6720	0.7043	0.6967	0.7043	0.6900
Justification	0.5285	0.4898	0.5153	0.5340	0.5148	0.5340	0.5194
Statement +
Date	0.3431	0.2540	0.3343	0.3380	0.2514	0.3380	0.3098
Subject	0.3548	0.2759	0.3513	0.3375	0.2580	0.3375	0.3192
Speaker	0.3618	0.2862	0.3539	0.3476	0.2640	0.3476	0.3269
Speaker Description	0.3583	0.2814	0.3531	0.3667	0.2886	0.3667	0.3358
State Info	0.3317	0.2367	0.3294	0.3328	0.2362	0.3328	0.2999
Credibility History	0.5067	0.4737	0.5084	0.5244	0.5000	0.5244	0.5063
Context	0.3361	0.2682	0.3391	0.3458	0.2560	0.3458	0.3152
Justification	0.6017	0.5578	0.5796	0.6176	0.6026	0.6176	0.5962
All	0.6974	0.6570	0.6676	0.7021	0.6961	0.7021	0.6871

h
NeurIT: Pushing the Limit of Neural Inertial Tracking for Indoor Robotic...
datahub.hku.hk
zip
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinzhe Zheng (2024). NeurIT: Pushing the Limit of Neural Inertial Tracking for Indoor Robotic IoTDataset (tracking record of wheeled robot) [Dataset]. http://doi.org/10.25442/hku.25264879.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25442/hku.25264879.v1
Dataset updated
Feb 26, 2024
Dataset provided by
HKU Data Repository
Authors
Xinzhe Zheng
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
NeurIT Dataset is open-sourced for public research usage. It is collected using the customized robotic platform across three buildings. We collect the training, validation, and test-seen sets in Building A, and build the test-seen and test-unseen set in Building B and C. During data collection, the robot moves at varying speeds up to the maximum value (1.5m/s). The dataset contains 110 sequences, totaling around 15 hours of tracking data that corresponds to a travel distance of about 33.7 km. Each sequence of data lasts 6~10 minutes, containing both IMU data (acceleration, gyroscope, magnetometer) and the ground truth trajectory. The ratio of the training set, validation set, test-seen set, and test-unseen set is 15:3:3:4.
P
TVQA Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jie Lei; Licheng Yu; Mohit Bansal; Tamara L. Berg, TVQA Dataset [Dataset]. https://paperswithcode.com/dataset/tvqa
Explore at:
Authors
Jie Lei; Licheng Yu; Mohit Bansal; Tamara L. Berg
Description
The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It includes 152,545 QA pairs from 21,793 TV show clips. The QA pairs are split into the ratio of 8:1:1 for training, validation, and test sets. The TVQA dataset provides the sequence of video frames extracted at 3 FPS, the corresponding subtitles with the video clips, and the query consisting of a question and four answer candidates. Among the four answer candidates, there is only one correct answer.
a
Tree Point Classification - New Zealand
geoportal-pacificcore.hub.arcgis.com
Updated Jul 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://geoportal-pacificcore.hub.arcgis.com/content/0e2e3d0d0ef843e690169cac2f5620f9
Explore at:
Dataset updated
Jul 26, 2022
Dataset authored and provided by
Eagle Technology Group Ltd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
P
EyePACS-light (v1) Dataset
paperswithcode.com
Updated Jul 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EyePACS-light (v1) Dataset [Dataset]. https://paperswithcode.com/dataset/eyepacs-light
Explore at:
Dataset updated
Jul 6, 2023
Description
This is a machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS train set. This dataset is split into training, validation, and test folders which contain 2500, 270, and 500 fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG).

Three versions of the same dataset are available with different standardization strategies:

RAW - Resizing the source image to 256x256 pixels PAD - Padding the source image to a square image and then resizing it to 256x256 pixels. This method preserves the aspect ratio but the resultant image contains less usable information. CROP - Cropping black background in the fundus image, pad the resultant image to create a square image, and then resize to 256x256 pixels. This method preserves the aspect ratio and the resultant image contains the most usable information.
h
first-impressions-v2
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yeray, first-impressions-v2 [Dataset]. https://huggingface.co/datasets/yeray142/first-impressions-v2
Explore at:
Authors
Yeray
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for First Impressions V2

The first impressions data set, comprises 10000 clips (average duration 15s) extracted from more than 3,000 different YouTube high-definition (HD) videos of people facing and speaking in English to a camera. The videos are split into training, validation and test sets with a 3:1:1 ratio. People in videos show different gender, age, nationality, and ethnicity. Videos are labeled with personality traits variables. Amazon Mechanical Turk (AMT) was… See the full description on the dataset page: https://huggingface.co/datasets/yeray142/first-impressions-v2.
Time Series Data of Gaze, Head Pose, Hand Pose, and Object Positions for...
zenodo.org
data.niaid.nih.gov
bin
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Fennel; Michael Fennel; Serge Garbay; Antonio Zea; Antonio Zea; Uwe D. Hanebeck; Uwe D. Hanebeck; Serge Garbay (2023). Time Series Data of Gaze, Head Pose, Hand Pose, and Object Positions for Object Approaches with a Given Intention [Dataset]. http://doi.org/10.5281/zenodo.7687774
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7687774
Dataset updated
Mar 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Fennel; Michael Fennel; Serge Garbay; Antonio Zea; Antonio Zea; Uwe D. Hanebeck; Uwe D. Hanebeck; Serge Garbay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set comprises time series data of gaze, head pose, hand pose, and object positions for object approaches with a given intention. The data was captured in the context of the following publication:

Michael Fennel, Serge Garbay, Antonio Zea, Uwe D. Hanebeck, Intention Estimation with Recurrent Neural Networks for Mixed Reality Environments, Proceedings of the 26th International Conference on Information Fusion (Fusion 2023) (under review)

A Microsoft Hololens 2 was used for recording the data at 60 fps under the modalities explained in detail in the above-mentioned paper.

The file names are structured as follows:

1st/2nd:

The data with "1st" contains approaches to randomly placed objects on a grid, which are rendered in augmented reality. The user is informed about the object to approach using a visual cue. This corresponds to Section IV-A.

The data with "2nd" contains approaches to real objects placed statically in a room. The user is informed about the object to approach using a voice command.

unfiltered: Contains all approaches, including those where the user disrespects the given commands. Filtering is done as described in the paper.

train/val/test: The first dataset was split in a 70/20/10 ratio for training, validation, and test.

Each data set contains the following columns. In each approach, 5 objects numbered from i=0 to i=4 are present.

General:

time: in seconds

subject: consecutive subject number

handedness: left (1), right (0)

trial: consecutive trial number per subject

target_label: index of the object to approach (0 to 4)

Data in world coordinates:

head_{x,y,z}: head position

head_quat_{w,x,y,z}: head orientation quaternion

W_gaze_{x,y,z}: gaze direction

W_r_hand_{x,y,z}: right hand position

W_r_hand_quat_{w,x,y,z}: right hand orientation quaternion

W_l_hand_{x,y,z}: left hand position

W_l_hand_quat_{w,x,y,z}: left hand orientation quaternion

W_object_i_{x,y,z}: position of object i

W_object_i_quat {w,x,y,z}: orientation quaternion of object i

Data in egocentric coordinates (head coordinate system). This data is provided for convenience and can be derived from the other data:

gaze_{x,y,z}: gaze direction

r_hand_{x,y,z}: right hand position

r_hand_quat_{w,x,y,z}: right hand orientation quaternion

l_hand_{x,y,z}: left hand position

l_hand_quat_{w,x,y,z}: left hand orientation quaternion

object_i_{x,y,z}: position of object i

object_i_quat {w,x,y,z}: orientation quaternion of object i

Acknowledgment:

This work was supported by the ROBDEKON project of the German Federal Ministry of Education and Research.
m
DAST scanning sessions dataset
data.mendeley.com
Updated Aug 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Branislav Rajić (2022). DAST scanning sessions dataset [Dataset]. http://doi.org/10.17632/ctkh2zy6s3.3
Explore at:
Unique identifier
https://doi.org/10.17632/ctkh2zy6s3.3
Dataset updated
Aug 16, 2022
Authors
Branislav Rajić
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of network flow features generated by the tool CICFlowMeter on network captures collected at the University of Belgrade, School of Electrical Engineering. The samples include scanning sessions of 4 DAST tools - Nikto, Vega, OWASP ZAP and Arachni targeted at the OWASP WebGoat application. DAST tools were installed on one virtual machine, while the target was placed on another, with all traffic being routed through a third machine which captured it using the tcpdump utility. For each of the scanners one session was captured, except Arachni, whose scanning phase was divided into 3 sessions. After processing the .pcap files with CICFlowMeter, the output for each of the sessions was split randomly into training, validation and test sets in 60:20:20 ratio, respectively. In addition to the scanning, OWASP ZAP and Vega offer built-in proxy servers for HTTP traffic examination. Interactions of these utilities and Webgoat application were also captured and are present in the dast proxies folder. Finally, in shortened flows, for each of the the scanning sessions, a subset of flows was pruned to 10, 15, 20, 25 and 50 packets. Features were extracted using CICFlowMeter once again, to allow for analysis of flow statistics at different points in time
h
ASCEND
huggingface.co
Updated Apr 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CAiRE HKUST (2023). ASCEND [Dataset]. https://huggingface.co/datasets/CAiRE/ASCEND
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2023
Dataset authored and provided by
CAiRE HKUST
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for ASCEND

Dataset Summary

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong. ASCEND consists of 10.62 hours of spontaneous speech with a total of ~12.3K utterances. The corpus is split into 3 sets: training, validation, and test with a ratio of 8:1:1 while maintaining a balanced gender proportion on each set.… See the full description on the dataset page: https://huggingface.co/datasets/CAiRE/ASCEND.
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
h
QWQ-LongCOT-AIMO
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Floppanacci (2025). QWQ-LongCOT-AIMO [Dataset]. https://huggingface.co/datasets/Floppanacci/QWQ-LongCOT-AIMO
Explore at:
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Floppanacci
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
QWQ-LongCOT-AIMO is a derived dataset created by processing the amphora/QwQ-LongCoT-130K dataset. It filters the original dataset to focus specifically on question-answering pairs where the final answer is a numerical value between 0 and 999, explicitly marked using the \boxed{...} format within the original chain-of-thought answer.

Dataset Structure Data Splits

The dataset is split into training, validation, and test sets with an 80/10/10 ratio based on the filtered… See the full description on the dataset page: https://huggingface.co/datasets/Floppanacci/QWQ-LongCOT-AIMO.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jiayun Liu (2025). Pubmed Journal Recommendation System dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8386010

Pubmed Journal Recommendation System dataset

Explore at:

Dataset updated

Mar 25, 2025

Dataset provided by

Raúl García Castro
Manuel Castillo Cara
Jiayun Liu

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

We extracted the journals and more information of:

Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

Dataset Components:

data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.

Clear search

Close search

Google apps

Main menu

Pubmed Journal Recommendation System dataset

AIWR Dataset

Fine-Grained Cloud Segmentation Dataset Dataset

halueval-sft

Hyperparameter values of the CNN model.

Data Sheet 2_Development and validation of a machine learning-based risk...

tomato-leaves-dataset

Data from: Ensembl TSS dataset for GRCh38

Synthetic Particle Image Dataset (SPID)

LIAR2 Dataset

NeurIT: Pushing the Limit of Neural Inertial Tracking for Indoor Robotic...

TVQA Dataset

Tree Point Classification - New Zealand

EyePACS-light (v1) Dataset

first-impressions-v2

Time Series Data of Gaze, Head Pose, Hand Pose, and Object Positions for...

DAST scanning sessions dataset

ASCEND

InductiveQE Datasets

QWQ-LongCOT-AIMO

Pubmed Journal Recommendation System dataset