Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.
We extracted the journals and more information of:
Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.
Dataset Components:
data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.
data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.
data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aerial Image Water Resources (AIWR) Dataset
According to the standard of land use code by fundamental geographic data set (FGDS), Thailand land use classification requires an analysis and transformation of satellite images data together with field survey data. In this article, researchers studied only land use in water bodies. The water bodies in this research can be divided into 2 levels: natural body of water (W1) artificial body of (W2) water.
The aerial image data used in this research was 1:50 meters. Every aerial image had 650x650 pixels. Those images included water bodies type W1 and W2. Ground truth of all aerial images was set for before sending it to be analyzed and interpreted by remote sensing experts. This assured that the water bodies groupings were correct. An example of ground truth, which has been checked by experts. Ground truth has been used in learning the algorithm in deep learning mode and also used in further evaluation.
The aerial images used in the experiment consists of water body: types W1 and W2. Aerial image water resources dataset, AIWR has 800 images. Data were chosen at random and divided into 3 sections: training, validation, and test set with ratio 8:1:1. Therefore, 640 aerial images were used for learning and creating the model, 80 images were used for validation, and the remaining 80 images were used for test.
The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat 8 OLI and TIRS, covering diverse biomes. This variety supports cloud detection and removal in complex environments. The dataset includes manually generated cloud masks with pixel-level annotations for cloud shadow, clear sky, thin clouds, and cloud areas. Each scene is cropped into 512Ă—512 pixel patches and split into training, validation, and test sets (6:2:2 ratio). It is a valuable resource for training and evaluating fine-grained cloud segmentation models across various terrains.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
HaluEval-SFT Dataset
HaluEval-SFT Dataset is derived from the HaluEval(https://github.com/RUCAIBox/HaluEval), focusing on enhancing model capabilities in recognizing hallucinations. The dataset comprises a total of 65,000 data points, partitioned into training, validation, and test sets with a ratio of 0.7/0.15/0.15, respectively.
Getting Started
from datasets import load_dataset dataset = load_dataset('jzjiao/halueval-sft', split = ["train"])
Dataset Description… See the full description on the dataset page: https://huggingface.co/datasets/jzjiao/halueval-sft.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Idiomatic expressions are built into all languages and are common in ordinary conversation. Idioms are difficult to understand because they cannot be deduced directly from the source word. Previous studies reported that idiomatic expression affects many Natural language processing tasks in the Amharic language. However, most natural language processing models used with the Amharic language, such as machine translation, semantic analysis, sentiment analysis, information retrieval, question answering, and next-word prediction, do not consider idiomatic expressions. As a result, in this paper, we proposed a convolutional neural network (CNN) with a FastText embedding model for detecting idioms in an Amharic text. We collected 1700 idiomatic and 1600 non-idiomatic expressions from Amharic books to test the proposed model’s performance. The proposed model is then evaluated using this dataset. We employed an 80 by 10,10 splitting ratio to train, validate, and test the proposed idiomatic recognition model. The proposed model’s learning accuracy across the training dataset is 98%, and the model achieves 80% accuracy on the testing dataset. We compared the proposed model to machine learning models like K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Random Forest classifiers. According to the experimental results, the proposed model produces promising results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveTo develop and validate a machine learning (ML)-based model for predicting stroke-associated pneumonia (SAP) risk in older adult hemorrhagic stroke patients.MethodsA retrospective collection of older adult hemorrhagic stroke patients from three tertiary hospitals in Guiyang, Guizhou Province (January 2019–December 2022) formed the modeling cohort, randomly split into training and internal validation sets (7:3 ratio). External validation utilized retrospective data from January–December 2023. After univariate and multivariate regression analyses, four ML models (Logistic Regression, XGBoost, Naive Bayes, and SVM) were constructed. Receiver operating characteristic (ROC) curves and area under the curve (AUC) were calculated for training and internal validation sets. Model performance was compared using Delong's test or Bootstrap test, while sensitivity, specificity, accuracy, precision, recall, and F1-score evaluated predictive efficacy. Calibration curves assessed model calibration. The optimal model underwent external validation using ROC and calibration curves.ResultsA total of 788 older adult hemorrhagic stroke patients were enrolled, divided into a training set (n = 462), an internal validation set (n = 196), and an external validation set (n = 130). The incidence of SAP in older adult patients with hemorrhagic stroke was 46.7% (368/788). Advanced age [OR = 1.064, 95% CI (1.024, 1.104)], smoking[OR = 2.488, 95% CI (1.460, 4.24)], low GCS score [OR = 0.675, 95% CI (0.553, 0.825)], low Braden score [OR = 0.741, 95% CI (0.640, 0.858)], and nasogastric tube [OR = 1.761, 95% CI (1.048, 2.960)] were identified as risk factors for SAP. Among the four machine learning algorithms evaluated [XGBoost, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes], the LR model demonstrated robust and consistent performance in predicting SAP among older adult patients with hemorrhagic stroke across multiple evaluation metrics. Furthermore, the model exhibited stable generalizability within the external validation cohort. Based on these findings, the LR framework was subsequently selected for external validation, accompanied by a nomogram visualization. The model achieved AUC values of 0.883 (training), 0.855 (internal validation), and 0.882 (external validation). The Hosmer-Lemeshow (H-L) test indicates that the calibration of the model is satisfactory in all three datasets, with P-values of 0.381, 0.142, and 0.066 respectively.ConclusionsThis study constructed and validated a risk prediction model for SAP in older adult patients with hemorrhagic stroke based on multi-center data. The results indicated that among the four machine learning algorithms (XGBoost, LR, SVM, and Naive Bayes), the LR model demonstrated the best and most stable predictive performance. Age, smoking, low GCS score, low Braden score, and nasogastric tube were identified as predictive factors for SAP in these patients. These indicators are easily obtainable in clinical practice and facilitate rapid bedside assessment. Through internal and external validation, the model was proven to have good generalization ability, and a nomogram was ultimately drawn to provide an objective and operational risk assessment tool for clinical nursing practice. It helps in the early identification of high-risk patients and guides targeted interventions, thereby reducing the incidence of SAP and improving patient prognosis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tomato Leaves Dataset
Overview
This dataset contains images of tomato leaves categorized into different classes based on the type of disease or health condition. The dataset is divided into training, validation, and test sets, with a ratio of 8:1:1. The classes include various diseases as well as healthy leaves. The dataset includes both augmented and non-augmented images.
Dataset Structure
The dataset is organized into three main splits:
train validation test… See the full description on the dataset page: https://huggingface.co/datasets/lorenzoxi/tomato-leaves-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.
First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.
Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPID is a comprehensive dataset composed of synthetic particle image velocimetry (PIV) image pairs and their corresponding exact optical flow computations. It serves as a valuable resource for researchers and practitioners in the field. The dataset is organized into three subsets: training, validation, and test, distributed in a ratio of 70%, 15%, and 15%, respectively.
Each subset within SPID consists of an input denoted as "x", which comprises synthetic image pairs. These image pairs provide the necessary context for the optical flow computations. Additionally, an output termed "y" is provided, which represents the exact optical flow calculated for each image pair. Notably, the images within the dataset are single-channel, and the optical flow is decomposed into its u and v components.
The shape of the input subsets in SPID is given by (number of samples, number of frames, image width, image height, number of channels), representing the dimensions of the input data. On the other hand, the shape of the output subsets is given by (number of samples, velocity components, image width, image height), denoting the shape of the optical flow data.
It is important to mention that SPID dataset is a preprocessed version of the Raw Synthetic Particle Image Dataset (RSPID), ensuring improved usability and reliability. Moreover, the dataset is packaged as a NumPy compressed NPZ file, which conveniently stores the inputs and outputs as separate NumPy NPZ files with the labels train, validation and test as acess keys. This format simplifies data extraction and integration into machine learning frameworks and libraries, facilitating seamless usage of the dataset.
SPID incorporates various factors that impact PIV analysis to provide a comprehensive and realistic simulation. The dataset includes image pairs with an image width of 665 pixels and an image height of 630 pixels, ensuring a high level of detail and accuracy with an 8-bit depth. It incorporates different particle radii (1, 2, 3, and 4 pixels) and particle densities (15, 17, 20, 23, 25, and 32 particles) to capture diverse particle configurations.
To simulate real-world scenarios, SPID introduces displacement variations through the delta x factor, ranging from 0.05% to 0.25%. Noise levels (1, 5, 10, and 15) are also incorporated to mimic practical PIV measurements with varying degrees of noise. Furthermore, out-of-plane motion effects are considered with standard deviations of 0.01, 0.025, and 0.05 to assess their impact on optical flow accuracy.
The dataset covers a wide range of flow patterns encountered in fluid dynamics. It includes Rankine uniform, Rankine vortex, parabolic, stagnation, shear, and decaying vortex flows, allowing for comprehensive testing and evaluation of PIV algorithms across different scenarios.
By leveraging the SPID dataset, researchers can develop and validate PIV algorithms and techniques under various challenging conditions. Its realistic and diverse simulation of particle image velocimetry scenarios makes it an invaluable tool for advancing the field and improving the accuracy and reliability of optical flow computations.
The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:
Statistics | LIAR | LIAR2 |
---|---|---|
Training set size | 10,269 | 18,369 |
Validation set size | 1,284 | 2,297 |
Testing set size | 1,283 | 2,296 |
Avg. statement length (tokens) | 17.9 | 17.7 |
Avg. speaker description length (tokens) | \ | 39.4 |
Avg. justification length (tokens) | \ | 94.4 |
Labels | ||
Pants on fire | 1,050 | 3,031 |
False | 2,511 | 6,605 |
Barely-true | 2,108 | 3,603 |
Half-true | 2,638 | 3,709 |
Mostly-true | 2,466 | 3,429 |
True | 2,063 | 2,585 |
Ablation Experiment The LIAR2 dataset is an upgrade of the LIAR dataset, which inherits the ideas of the LIAR dataset, refines the details and architecture, and expands the size of the dataset to make it more responsive to the needs of fake news detection tasks. We believe that with the help of the LIAR2 dataset, it will be able to perform better fake news detection tasks. The analysis and baseline information about the LIAR2 dataset is provided in below.
Feature | Val. Accuracy | Val. F1-Macro | Val. F1-Micro | Test Accuracy | Test F1-Macro | Test F1-Micro | Mean |
---|---|---|---|---|---|---|---|
Statement | 0.3174 | 0.1957 | 0.3117 | 0.3197 | 0.2380 | 0.3197 | 0.2837 |
Date | 0.2912 | 0.1879 | 0.2912 | 0.3079 | 0.1775 | 0.3079 | 0.2606 |
Subject | 0.3243 | 0.2311 | 0.3183 | 0.3267 | 0.2271 | 0.3267 | 0.2924 |
Speaker | 0.3283 | 0.2250 | 0.3174 | 0.3310 | 0.2462 | 0.3310 | 0.2965 |
Speaker Description | 0.3322 | 0.2444 | 0.3250 | 0.3280 | 0.2444 | 0.3280 | 0.3003 |
State Info | 0.2930 | 0.1577 | 0.2950 | 0.2979 | 0.1521 | 0.2979 | 0.2489 |
Credibility History | 0.5007 | 0.4696 | 0.4985 | 0.5057 | 0.4656 | 0.5057 | 0.4910 |
Context | 0.2982 | 0.1817 | 0.2982 | 0.3132 | 0.1791 | 0.3132 | 0.2639 |
Justification | 0.5964 | 0.5657 | 0.5827 | 0.6115 | 0.5968 | 0.6115 | 0.5941 |
All without | |||||||
Statement | 0.7079 | 0.6734 | 0.6822 | 0.7182 | 0.7108 | 0.7182 | 0.7018 |
Date | 0.6931 | 0.6572 | 0.6680 | 0.7078 | 0.6993 | 0.7078 | 0.6889 |
Subject | 0.7000 | 0.6579 | 0.6681 | 0.7078 | 0.7013 | 0.7078 | 0.6905 |
Speaker | 0.6944 | 0.6648 | 0.6757 | 0.7043 | 0.6942 | 0.7043 | 0.6896 |
Speaker Description | 0.6892 | 0.6640 | 0.6739 | 0.7169 | 0.7073 | 0.7169 | 0.6947 |
State Info | 0.7074 | 0.6625 | 0.6729 | 0.7099 | 0.7016 | 0.7099 | 0.6940 |
Credibility History | 0.6025 | 0.5717 | 0.5900 | 0.6185 | 0.6046 | 0.6185 | 0.6010 |
Context | 0.7005 | 0.6622 | 0.6720 | 0.7043 | 0.6967 | 0.7043 | 0.6900 |
Justification | 0.5285 | 0.4898 | 0.5153 | 0.5340 | 0.5148 | 0.5340 | 0.5194 |
Statement + | |||||||
Date | 0.3431 | 0.2540 | 0.3343 | 0.3380 | 0.2514 | 0.3380 | 0.3098 |
Subject | 0.3548 | 0.2759 | 0.3513 | 0.3375 | 0.2580 | 0.3375 | 0.3192 |
Speaker | 0.3618 | 0.2862 | 0.3539 | 0.3476 | 0.2640 | 0.3476 | 0.3269 |
Speaker Description | 0.3583 | 0.2814 | 0.3531 | 0.3667 | 0.2886 | 0.3667 | 0.3358 |
State Info | 0.3317 | 0.2367 | 0.3294 | 0.3328 | 0.2362 | 0.3328 | 0.2999 |
Credibility History | 0.5067 | 0.4737 | 0.5084 | 0.5244 | 0.5000 | 0.5244 | 0.5063 |
Context | 0.3361 | 0.2682 | 0.3391 | 0.3458 | 0.2560 | 0.3458 | 0.3152 |
Justification | 0.6017 | 0.5578 | 0.5796 | 0.6176 | 0.6026 | 0.6176 | 0.5962 |
All | 0.6974 | 0.6570 | 0.6676 | 0.7021 | 0.6961 | 0.7021 | 0.6871 |
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
NeurIT Dataset is open-sourced for public research usage. It is collected using the customized robotic platform across three buildings. We collect the training, validation, and test-seen sets in Building A, and build the test-seen and test-unseen set in Building B and C. During data collection, the robot moves at varying speeds up to the maximum value (1.5m/s). The dataset contains 110 sequences, totaling around 15 hours of tracking data that corresponds to a travel distance of about 33.7 km. Each sequence of data lasts 6~10 minutes, containing both IMU data (acceleration, gyroscope, magnetometer) and the ground truth trajectory. The ratio of the training set, validation set, test-seen set, and test-unseen set is 15:3:3:4.
The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It includes 152,545 QA pairs from 21,793 TV show clips. The QA pairs are split into the ratio of 8:1:1 for training, validation, and test sets. The TVQA dataset provides the sequence of video frames extracted at 3 FPS, the corresponding subtitles with the video clips, and the query consisting of a question and four answer candidates. Among the four answer candidates, there is only one correct answer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
This is a machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS train set. This dataset is split into training, validation, and test folders which contain 2500, 270, and 500 fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG).
Three versions of the same dataset are available with different standardization strategies:
RAW - Resizing the source image to 256x256 pixels PAD - Padding the source image to a square image and then resizing it to 256x256 pixels. This method preserves the aspect ratio but the resultant image contains less usable information. CROP - Cropping black background in the fundus image, pad the resultant image to create a square image, and then resize to 256x256 pixels. This method preserves the aspect ratio and the resultant image contains the most usable information.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for First Impressions V2
The first impressions data set, comprises 10000 clips (average duration 15s) extracted from more than 3,000 different YouTube high-definition (HD) videos of people facing and speaking in English to a camera. The videos are split into training, validation and test sets with a 3:1:1 ratio. People in videos show different gender, age, nationality, and ethnicity. Videos are labeled with personality traits variables. Amazon Mechanical Turk (AMT) was… See the full description on the dataset page: https://huggingface.co/datasets/yeray142/first-impressions-v2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set comprises time series data of gaze, head pose, hand pose, and object positions for object approaches with a given intention. The data was captured in the context of the following publication:
A Microsoft Hololens 2 was used for recording the data at 60 fps under the modalities explained in detail in the above-mentioned paper.
The file names are structured as follows:
Each data set contains the following columns. In each approach, 5 objects numbered from i=0 to i=4 are present.
Acknowledgment:
This work was supported by the ROBDEKON project of the German Federal Ministry of Education and Research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of network flow features generated by the tool CICFlowMeter on network captures collected at the University of Belgrade, School of Electrical Engineering. The samples include scanning sessions of 4 DAST tools - Nikto, Vega, OWASP ZAP and Arachni targeted at the OWASP WebGoat application. DAST tools were installed on one virtual machine, while the target was placed on another, with all traffic being routed through a third machine which captured it using the tcpdump utility. For each of the scanners one session was captured, except Arachni, whose scanning phase was divided into 3 sessions. After processing the .pcap files with CICFlowMeter, the output for each of the sessions was split randomly into training, validation and test sets in 60:20:20 ratio, respectively. In addition to the scanning, OWASP ZAP and Vega offer built-in proxy servers for HTTP traffic examination. Interactions of these utilities and Webgoat application were also captured and are present in the dast proxies folder. Finally, in shortened flows, for each of the the scanning sessions, a subset of flows was pruned to 10, 15, 20, 25 and 50 packets. Features were extracted using CICFlowMeter once again, to allow for analysis of flow statistics at different points in time
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for ASCEND
Dataset Summary
ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong. ASCEND consists of 10.62 hours of spontaneous speech with a total of ~12.3K utterances. The corpus is split into 3 sets: training, validation, and test with a ratio of 8:1:1 while maintaining a balanced gender proportion on each set.… See the full description on the dataset page: https://huggingface.co/datasets/CAiRE/ASCEND.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
InductiveQE datasets
UPD 2.0: Regenerated datasets free of potential test set leakages
UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs
This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.
Each dataset is a zip archive containing 17 files:
Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.
The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.
Paper pre-print: https://arxiv.org/abs/2210.08008
The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
QWQ-LongCOT-AIMO is a derived dataset created by processing the amphora/QwQ-LongCoT-130K dataset. It filters the original dataset to focus specifically on question-answering pairs where the final answer is a numerical value between 0 and 999, explicitly marked using the \boxed{...} format within the original chain-of-thought answer.
Dataset Structure
Data Splits
The dataset is split into training, validation, and test sets with an 80/10/10 ratio based on the filtered… See the full description on the dataset page: https://huggingface.co/datasets/Floppanacci/QWQ-LongCOT-AIMO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.
We extracted the journals and more information of:
Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.
Dataset Components:
data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.
data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.
data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.