Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset from "A Deep Learning Framework for Verilog Autocompletion Towards Design and Verification Automation", which was first presented as a WiP paper at DAC 2023 and now accepted to the IEEE SOCC 2025 special session on "AI-Enhanced Semiconductor Manufacturing: Intelligent Solutions for Next-Generation Fabrication". To address the scarcity of publicly available Verilog code for training machine learning models, this study introduces a novel dataset specifically curated for Verilog autocompletion tasks. The dataset comprises over 100k Verilog files and 140k code snippets sourced from open-source repositories with permissive licenses (a list of which is available in permissive_all_deduplicated_repos.csv). It includes three subsets: file-level data, snippet-level data, and labeled definition-body pairs, each split into training, validation, and test sets. The dataset was meticulously filtered to remove autogenerated content, non-compliant licenses, and near-duplicate files, ensuring high-quality and diverse training material. Snippets were extracted using regular expressions, and additional quality control was applied by selecting files from repositories with at least one GitHub star for evaluation splits. This dataset serves as the foundation for fine-tuning pretrained language models toward Verilog code generation, enabling more effective automation in electronic design and verification workflows. More details about the dataset process can be found in the related research paper. A zipped copy of the github repository (https://github.com/99EnriqueD/verilog_autocompletion) containing code to replicate the dataset creation process has also been included in this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:
magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.
M24/M48: both present the following sub-folders structure:
Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:
inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:
magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where:
is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where:
is Seq16 if refers to a sequence, or void if refers direct to images.
"24h" or "48h".
is "TrainVal" or "Test". The refers to the split of Train/Val.
void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders:
void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test);
"24h" or "48h";
"oneSplit" for a specific split or "allSplits" if run all splits.
void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where:
point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt:
train or val;
void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where:
k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where:
k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where:
epoch number of the checkpoint;
corresponding valid loss;
0 to 4.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AeroSonic YPAD-0523: Labelled audio dataset for acoustic detection and classification of aircraft
Version 0.2 (June 2023)
Publication
If using this data in an academic work, please reference the DOI and version.
Description
AeroSonic:YPAD-0523 is a specialised dataset of ADS-B labelled audio clips for research in the fields of aircraft noise attribution and machine listening, particularly acoustic detection and classification of low-flying aircraft. Audio files in this dataset were recorded at locations in close proximity to a flight path approaching or departing Adelaide International Airport’s (ICAO code: YPAD) primary runway, 05/23. Recordings are initially labelled from radio (ADS-B) messages received from the aircraft overhead. Each recording is then human verified, and trimmed to the best (subjective) 20 seconds of audio in which the target aircraft is audible.
A total of 1,890 audio clips are balanced across two top-level classes, “Aircraft” (3.57 hours: 642 20-second recordings) and “Silence” (3.37 hours: 1,248 5 and 10-second recordings). The aircraft class is then further broken-down into four unbalanced subclasses which broadly describe an aircrafts structure and propulsion mechanism. A variety of additional "airframe" features are provided to give researchers finer control of the dataset, and the opportunity to develop ontologies specific to their own use case.
For convenience, the dataset has been split into training (6.28 hours) and testing (0.66 hours) subsets, with the training set further split into 10 folds for cross-validation. Care has been taken to ensure the class distribution for each subset and fold does not significantly deviate from the overall distribution.
Researchers may find applications for this dataset in a number of fields; particularly aircraft noise isolation and monitoring in an urban environment, development of passive acoustic systems to assist radar technology, and understanding the sources of aircraft noise to help manufacturers design less-noisy aircraft.
Audio data
ADS-B (Automatic Dependent Surveillance–Broadcast) messages transmitted directly from aircraft are used to automatically capture and label audio recordings. A 60-second recording is triggered when an aircraft transmits a message indicating it is within a specified distance of the recording device. The file is labelled with a unique ICAO identifier code for the aircraft, as well as its last recorded altitude, date and time. The recording is then human verified and trimmed to 20 seconds - with the aircraft audible for the duration of the clip.
A balanced collection of urban background noise without aircraft (silence) is included with the dataset as a means of distinguishing location specific environmental noises from aircraft noises. 10-second background noise, or “silence” recordings are triggered only when there are no aircraft broadcasting that they are within a specified distance of the recording device. These "silence" recordings are also human verified to ensure no aircraft noise is present. The dataset contains 1,180 10-second clips, and 68 5-second clips of silence/ambient background noise.
Location information
Recordings have been collected from three (3) locations. GPS coordinates for each location are provided in the "locations.json" file. In order to protect privacy, coordinates have been provided for a road or public space nearby the recording device instead of its exact location.
Location: 0
Situated in a suburban environment approximately 15.5km north-east of the start/end of the runway. For Adelaide, typical south-westerly winds bring most arriving aircraft past this location on approach. Winds from the north or east will cause aircraft to take-off to the north-east, however not all departing aircraft will maintain a course to trigger a recording at this location. The "trigger distance" for this location is set for 3km to ensure small/slower aircraft and large/faster aircraft are captured within a sixty-second recording.
"Silence" or ambient background noises at this location include; cars, motorbikes, light-trucks, garbage trucks, power-tools, lawn mowers, construction sounds, sirens, people talking, dogs barking and a wide range of Australian native birds (New Holland Honeyeaters, Wattlebirds, Australian Magpies, Australian Ravens, Spotted Doves, Rainbow Lorikeets and others).
Location: 1
Situated approximately 500m south-east of the south-eastern end of the runway, this location is nearby recreational areas (golf course, skate park and parklands) with a busy road/highway inbetween the location and runway. This location features heavy winds and road traffic, as well as people talking, walking and riding, and also birds such as the Australian Magpie and Noisy Miner. The trigger distance for this location is set to 1km. Due to their low altitude aircraft are louder, but audible for a shorter time compared to "Location 0".
Location: 2
As an alternative to "Location 1", this location is situated approximately 950m south-east of the end of the runway. This location has a wastewater facility to the north, a residential area to the south and a popular beach to the west. This location offers greater wind protection and further distance from airport and highway noises. Ambient background sounds feature close proximity cars and motorbikes, cyclists, people walking, nail guns and other construction sounds, as well as the local birds mentioned above.
Aircraft metadata
Supplementary "airframe" metadata for all aircraft has been gathered to help broaden the research possibilities from this dataset. Airframe information was collected and cross-checked from a number of open-source databases. The author has no reason to beleive any significant errors exist in the "aircraft_meta" files, however future versions of this dataset plan to obtain aircraft information directly from ICAO (International Civil Aviation Organization) to ensure a single, verifiable source of information.
Class/subclass ontology (minutes of recordings)
0. no aircraft (202)
0: no aircraft (202)
1. aircraft (214)
1: piston-propeller aeroplane (12)
2: turbine-propeller aeroplane (37)
3: turbine-fan aeroplane (163)
4: rotorcraft (1.6)
The subclasses are a combination of the "airframe" and "engtype" features. Piston and Turboshaft rotorcraft/helicopters have been combined into a single subclass due to the small number of samples.
Data splits
Audio recordings have been split into training (90.5%) and test (9.5%) sets. The training set has further been split into 10 folds, giving researchers a common split to perform 10-fold cross-validation - ensuring reproducibility and comparative results. Data leakage into the test set has been avoided by ensuring recordings are disjointed from the training set by time and location - meaning samples in the test set for a particular location were recorded after any samples included in the training set for that particular location.
Labelled data
The entire dataset (training and test) is referenced and labelled in the "sample_meta.csv" file. Each row contains a reference to a unique recording and all the labels and features associated with that recording and aircraft.
Alternatively, these labels can be derived directly from the filename of the sample (see below), plus a JSON file which accompanies each aircraft sample. The "aircraft_meta.csv" and "aircraft_meta.json" files can be used to reference aircraft specific features - such as; manufacturer, engine type, ICAO type designator etc. (see below for all 14 airframe features).
File naming convention
Audio samples are in WAV format, and metadata for aircraft recordings are stored in JSON files. Both files share the same name, only differing by their file extension.
Basic Convention
“Aircraft ID + Date + Time + Location ID + Microphone ID”
“XXXXXX_YYYY-MM-DD_hh-mm-ss_X_X”
Sample with aircraft
{hex_id} _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}
7C7CD0_2023-05-09_12-42-55_2_1.wav
7C7CD0_2023-05-09_12-42-55_2_1.json
Sample without aircraft
“Silence” files are denoted with six (6) leading zeros rather than an aircraft hex code. All relevant metadata for “silence” samples are contained in the audio filename, and again in the accompanying “sample_meta.csv”
000000 _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}
000000_2023-05-09_12-30-55_2_1.wav
Columns/Labels
(found in sample_meta.csv, aircraft_meta.csv/json and aircraft recording JSON files)
train-test: Train-test split (train, test)
fold: Digit from 0 to 9 splitting the training subset 10 ways (else test)
filename: The filename of the audio recording
date: Date of the recording
time: Time of the recording
duration: Length of the recording (in seconds)
location_id: ID for the location of the recording
microphone_id: ID of the microphone used
hex_id: Unique ICAO 24-bit address for the aircraft
Facebook
TwitterREADME.txt Title: Identifying Machine-Paraphrased Plagiarism
Authors: Jan Philip Wahle, Terry Ruas, Tomas Foltynek, Norman Meuschke, and Bela Gipp
contact email: wahle@gipplab.org; ruas@gipplab.org;
Venue: iConference
Year: 2022
================================================================
Dataset Description: Training:
200,767 paragraphs (98,282 original, 102,485paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API). Testing:
SpinBot:
arXiv - Original - 20,966; Spun - 20,867
Theses - Original - 5,226; Spun - 3,463
Wikipedia - Original - 39,241; Spun - 40,729
SpinnerChief-4W:
arXiv - Original - 20,966; Spun - 21,671
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,618
SpinnerChief-2W:
arXiv - Original - 20,966; Spun - 21,719
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,697 ================================================================
Dataset Structure: [human_evaluation] folder: human evaluation to identify human-generated text and machine-paraphrased text. It contains the files (original and spun) as for the answer-key for the survey performed with human subjects (all data is anonymous for privacy reasons). NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded.
[automated_evaluation]: contains all files used for the automated evaluation considering spinbot and spinnerchief. Each paraphrase tool folder contains: [corpus] and [vectors] sub-folders. For [spinnerchief], two variations are included, with 4-word-chaging ratio (default) and 2-word-chaging ratio. [vectors] sub-folder contains the average of all word vectors for each paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e., label mg or og). Each file belongs to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file. The word embedding technique used is described in the file name with the following structure:
Facebook
Twitterhttps://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licencehttps://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence
The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary.
Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date
All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively.
/ - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 /
The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances).
The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.
Facebook
TwitterThe dataset contains pairs table-question, and the respective answer. The questions require multi-step reasoning and various data operations such as comparison, aggregation, and arithmetic computation. The tables were randomly selected among Wikipedia tables with at least 8 rows and 5 columns.
(As per the documentation usage notes)
Dev: Mean accuracy over three (not five) splits of the training data. In other words, train on 'split-{1,2,3}-train' and test on 'split-{1,2,3}-dev', respectively, then average the accuracy.
Test: Train on 'train' and test on 'test'.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wiki_table_questions', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15–20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4–5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal–ligand bond length prediction (0.004–5 Å MUE) and redox potential on a smaller data set (0.2–0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance metrics for hyperparameter tuning of the diagnostic module. Results are reported as mean ± standard deviation over three runs. “*” indicates experiments repeated with the same data split, while “**” indicates runs with three randomized train/test splits.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.
Here is a description for each column header of the fake news dataset:
id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.
headline: The title or headline of the news article, summarizing the key news story in brief.
written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.
news: The full text content of the news article, which is the main body used for analysis and classification.
label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.
This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
I. IDENTIFYING INFORMATION
Title* SweWiC v2.0
Subtitle A Swedish Word-in-Context dataset
Created by* Gerlof Bouma (gerlof.bouma@gu.se)
Publisher(s)* Språkbanken Text
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/swewic
License(s)* CC BY 4.0 for development and test data
CC BY-SA 4.0 for the full dataset consisting of training, development and test data.
Abstract* The Swedish Word-in-Context dataset provides a benchmark for evaluating distributional models of word meaning, in particular context-sensitive/dynamic models. Constructed following the principles of the (English) Word-in-Context dataset, the SweWiC test data consists of 1000 sentence pairs, where each sentence in a pair contains an occurence of a potentially ambiguous focus word specific to that pair. The question posed to the tested system is whether these two occurrences represent instances of the same word sense. Starting from version 2.0, SweWiC also contains training and development data, suitable for, for instance, fine tuning a pre-trained language model to the WiC task.
Funded by* Vinnova (grants no. 2020-02523, 2021-04165)
Cite as
Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim)
II. USAGE
Key applications Evaluation of (preferably dynamic) representations of word meaning, and finetuning such models.
Intended task(s)/usage(s) For each test pair, predict if the uses of the focus word in two different contexts constitute the same sense.
Recommended evaluation measures Accuracy
Dataset function(s) Training, development and testing.
Recommended split(s) Data comes organized in train, dev and test portions.
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* The test data consists of 1000 items, 500 positive, 500 negative. Constructed from 558 focus word types: 263 types occurring in one test item, 148 in two, and 147 in three. The focus words are of the following parts of speech according to SALDO: 462 nouns, 353 verbs, 143 adjectives, 32 adverbs, 9 prepositions, and 1 interjection.
The development data consist of 500 items, 250 positive, 250 negative. Constructed from 274 focus word types: 138 types occurring in one item, 46 in two, and 90 in three. The focus words are of the following parts of speech according to SALDO: 207 nouns, 176 verbs, 88 adjectives, 23 adverbs, 4 prepositions, and 2 subordinating conjunctions.
The training data consist of 4486 items, 2243 positive, 2243 negative. Constructed from 911 focus word types: 363 types occurring in once item, 161 in two, 119 in three, 41 in four, 43 in five, and 184 types occurring in more than five items. The focus words are of the following parts of speech according to the Swedish Wiktionary: 1815 verbs, 1705 nouns, 690 adjectives, 185 adverbs, 78 prepositions, 7 conjunctions, and 6 interjections.
Nature of the content* Pairs of sentences with one highlighted word form in each sentence, such that these highlighted forms are linked to the same base form (but not necessarily in the same paradigm!). These pairs are accompanied with an indication of whether these forms in these contexts have the same sense (meaning) or not. For the development and test data, the lexical resource SALDO is used to supply the senses and sense distinctions, for the training data, Wiktionary.
Format* JSON Lines, with 1 test item per line. Each item is given as a pair first word in context-second word in context (attributes 'first' and 'second'). The 'label' attribute states whether the same sense is used or not. A word in context is given as a string for the whole context ('context'), and an object ('word' attribute) that specifies the focus word using a string ('text') and string indices ('start', 'stop'). Index ranges are 0-based and half-open, and refer to the NFKC-normalized unicode string. Metadata included for each item is intended for analysis, and not for use by the sense disambiguation system.
Data source(s)* SALDO v2.3 (CC BY 4.0, https://spraakbanken.gu.se/en/resources/saldo, see also [1]) is used to provide the sense inventory for the test data.
SALDO’s morphology (CC BY 4.0, https://spraakbanken.gu.se/en/resources/saldom) is used to ensure that the word forms in the sentences are possible word forms for the sense(s) involved in the test item.
SALDO examples (CC BY 4.0, https://spraakbanken.gu.se/en/resources/saldoe) and Eukalyptus v0.2.0 (mixed CC licenses, inclusion of sentences in SweWiC under CC BY 4.0 with permission, https://spraakbanken.gu.se/en/resources/eukalyptus; see [2] for the sense annotation in this corpus) are used as sources of sense annotated words in context in the development and test data.
Swedish Wiktionary (CC BY-SA 3.0, https://sv.wiktionary.gu.se/, Enterprise HTML dump of 20220901) is used for the examples in the training data.
Data collection method(s)* (See data selection and filtering.)
Data selection and filtering* In the spirit of the design principles given in [3], the development and test items adhere to the following restrictions:
all focus words are potentially ambiguous according to SALDO, even in the same-sense test items,
none of the involved meanings is directly at SALDO’s top level (this avoids abstract ”semantic primitives”),
a focus word type occurs at most in three items in the test set, no combination of a focus word and a context is repeated,
the instances in both contexts in a test item are of the same part of speech (SALDO in principle allows for semantic base forms that cross part of speech), and
SALDO’s morphology lists the word forms used in the contexts as possible realizations of the involved senses.
The training items are not subject to restriction 3 listed above (cf. [3]). Restriction 4 in the training data is based upon the parts of speech supplied by Swedish Wiktionary. Finally, constraints 1, 2 and 4 are implemented in the test data by linking the word forms in the test items to SALDO meanings through SALDO’s morphology’s fullform lexicon. Therefore, only Wiktionary items are included whose full forms appear in SALDO.
Data preprocessing* None.
Data labeling* Judgements about word senses are taken from resources with manual annotation of word senses, and therefore constitute gold-standard data where the development data is concerned.
Annotator characteristics (No additional annotation, that is, beyond the annotation done in the projects creating the data sources, was done in the compilation of SweWiC.)
IV. ETHICS AND CAVEATS
Ethical considerations None to report.
Things to watch out for Although care has been taken to only include training data from Wiktionary for items that SALDO also thinks are ambiguous, it is not necessarily the case that Wiktionary and SALDO make the same distinctions. Some deviation between the two is therefore to be expected.
V. ABOUT DOCUMENTATION
Data last updated* 20221007, v2.0
Which changes have been made, compared to the previous version* A small number of items (about 10) where replaced in the test data. Development data and training data where added.
Access to previous versions Earlier versions available from website.
This document created* 20210615 Gerlof Bouma (gerlof.bouma@gu.se)
This document last updated* 20230208 Gerlof Bouma (gerlof.bouma@gu.se)
Documentation template version* v1.1
VI. OTHER
Related projects The task and the design principles of the dataset were taken from / heavily inspired by the original (English) Word-in-Context benchmark described in [3]. See also the companion website https://pilehvar.github.io/wic/.
A description of a collection of WiCs for 12 languages (but not Swedish) is given in [4]. See also https://pilehvar.github.io/xlwic/.
References [1] Borin, Forsberg and Lönngren (2013): SALDO: a touch of yin to WordNet's yang. Language resources and evaluation 47(4), pp1191-1211. https://doi.org/10.1007/s10579-013-9233-4
[2] Johansson, Adesam, Bouma and Hedberg (2016): A Multi-domain Corpus of Swedish Word Sense Annotation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp3019–3022. https://www.aclweb.org/anthology/L16-1482.pdf
[3] Pilehvar and Camacho-Collados (2019): WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). http://dx.doi.org/10.18653/v1/N19-1128
[4] Raganato, Pasini, Camacho-Collados and Pilehvar (2020): XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). http://dx.doi.org/10.18653/v1/2020.emnlp-main.584
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine Learning dataset from 2000-2019, specifically used to train UNET neural networks, containing the following training data processed to a CONUS-like domain, 10.5 to 59.5 latitude and -159.5 to -60.5 longitude, on half degree resolution from 11 ensemble members of 6-hourly GEFS data and vapor pressure deficit (VPD) labels created on the same domain from ERA5.
Training data are from years: 2000, 2001, 2003-2006, 2009-2012, 2016, 2017, and 2019
Validation data are from years: 2002, 2008, 2014, and 2018
Blind testing data are from years 2007, 2013, and 2015
The input data are created from week 4 forecasting data produced by the GEFS initialized on the first Wednesdays of the year. Input data included in this dataset are:
Finally, the files are normalized by z-score normalization by pressure height (or surface) and variable. They are then saved into npy matrices sized [99, 199, 6] in the above order for NN training purposes. The VPD labels are the coresponding weekly mean VPD per gridcell derived form ERA5 data, and stored in npy files sized [99, 199, 1] for NN label purposes, intended to represent "observed" vpd on the corresponding week three forecast. They have an identical name to the input but are stored in the label directory. The zip files has been zipped in a way that contains subdirectories directories storing the npy files and identical data-label names as the following:
naming example - nn_dataset_YYYY_week_WW_ens_E_f_3.npy
where YYYY = year (2019)
where WW = week (1 through up to 48)
where E = GEFS ensemble number (0-10)
where f_3 means forecast week three (0-4 included in initial GEFS dataset)
Directories are named to divide npy files into:
Lastly, an additional file called "norm_inference_vars" is concluded and contains the validation and testing input variable datasets standard deviations and means.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Determining accurate drug dissolution processes in the gastrointestinal tract is critical in drug discovery as dissolution profiles provide essential information for estimating the bioavailability of orally administered drugs. While various methods have been developed to predict drug solubility based on chemical structures, no reliable tools currently exist for predicting the dissolution rate constant. This study presents a novel two-stage machine learning approach, termed Machine Learning based Quantitative Structure–Dissolution Profile Relationship, which integrates physics-informed neural networks (PINNs) and deep neural networks (DNNs) to predict drug dissolution profiles in water, with varying concentrations of surfactant Sodium Lauryl Sulfate. In the first stage, PINNs extract key dissolution parametersnamely the dissolution rate constant (k) and the dissolved mass fraction at saturation (ϕs)from existing dissolution data. By leveraging a physical law governing the dissolution process, PINNs aim to enhance prediction performance and reduce data requirements. Assuming first-order kinetics of the drug dissolution process as described by the Noyes–Whitney equation, PINNs, with 8 hidden layers and 40 neurons per layer, may outperform traditional nonlinear regression by effectively filtering noise and focusing on physically meaningful data. In the second stage, these extracted parameters (k and ϕs) are used to train a DNN to predict dissolution profiles based on the drug’s chemical structure and dissolution medium. Using the FDA-recommended metrics: the difference and similarity factors (f1 and f2), the DNNwith 128 neurons in two hidden layers and a learning rate of 10–2.8achieved an average testing accuracy of 61.7% at an 80:20 train-to-test split. Although this current accuracy is below the generally acceptable range of 70–80%, this approach shows significant potential as a low-cost, time-efficient tool for early phase drug formulation. Future improvements are expected as data quality and diversity increase.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.
First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.
Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When you use this dataset, please cite this paper. More information about this dataset could also be found in this paper.
Xu, X., Wang, B., Xiao, B., Niu, Y., Wang, Y., Wu, X., & Chen, J. (2024). Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals. arXiv preprint arXiv:2405.17024.
The present work aims to demonstrate that temporal autocorrelations (TA) significantly impacts various BCI tasks even in conditions without neural activity. We used the watermelon as the phantom head and found that we could get the pitfall of overestimated decoding performance if continuous EEG data with the same class label were split into training and test sets. More details can be found in Motivation.
As watermelons cannot perform any experimental tasks, we can reorganize it to the format of various actual EEG dataset without the need to collect EEG data as previous work did (examples in Domain Studied).
Manufacturers: NeuroScan SynAmps2 system (Compumedics Limited, Victoria, Australia)
Configuration: 64-channel Ag/AgCl electrode cap with a 10/20 layout
Watermelons. Ten watermelons served as phantom heads.
Overestimated Decoding Performance in EEG decoding.
Following BCI datasets in various BCI tasks have been reorganized using the Phantom EEG Dataset. The pitfall has been found in four of five tasks.
- CVPR dataset [1] for image decoding task.
- DEAP dataset [2] for emotion recognition task.
- KUL dataset [3] for auditory spatial attention decoding task.
- BCIIV2a dataset [4] for motor imagery task (the pitfalls were absent due to the use of rapid-design paradigm during EEG recording).
- SIENA dataset [5] for epilepsy detection task.
Resting State but you could reorganize it to any task in BCI.
The Phantom EEG Dataset
Creative Commons Attribution 4.0 International
Your could get the code to read the data files (.cnt or .set) in the “code” folder.
To run the codes, you should install the mne and numpy package. You could install via pip
pip install mne==1.3.1
pip install numpy
Then, you could use “BID2WMCVPR.py” to convert the BID dataset to the WM-CVPR dataset. You could also use “CNTK2WMCVPR.py” to convert the CNT dataset to the WM-CVPR dataset.
The codes to reorganize other datasets other than CVPR [1] will be released on github after reviewing.
- CNT: the raw data.
Each Subject (S*.cnt) contains the following information:
EEG.data: EEG data (samples X channels)
EEG.srate: Sampling frequency of the saved data
EEG.chanlocs : channel numbers (1 to 68, ‘EKG’ ‘EMG’ 'VEO' 'HEO' were not recorded)
- BIDS: an extension to the brain imaging data structure for electroencephalography. BIDS primarily addresses the heterogeneity of data organization by following the FAIR principles [6].
Each Subject (sub-S*/eeg/) contains the following information:
sub-S*_task-RestingState_channels.tsv: channel numbers (1 to 68, ‘EKG’ ‘EMG’ 'VEO' 'HEO' were not recorded)
sub-S*_task-RestingState_eeg.json: Some information about the dataset.
sub-S*_task-RestingState_eeg.set: EEG data (samples X channels)
sub-S*_task-RestingState_events.tsv: the event during recording. We organized events using block-design and rapid-event-design. However, it is important to note that this does not need to be considered in any subsequent data reorganization, as watermelons cannot follow any experimental instructions.
- code: more information on Code.
- readme.md: the information about the dataset.
An additional electrode was placed on the lower part of the watermelon as the physiological reference, and the forehead served as the ground site. The inter-electrode impedances were maintained under 20 kOhm. Data were recorded at a sampling rate of 1000 Hz. EEG recordings for each watermelon lasted for more than 1 hour to ensure sufficient data for the decoding task.
Citation will be updated after the review period is completed.
We will provide more information about this dataset (e.g. the units of the captured data) once our work is accepted. This is because our work is currently under review, and we are not allowed to disclose more information according to the relevant requirements.
All metadata will be provided as a backup on Github and will be available after the review period is completed.
Researchers have reported high decoding accuracy (>95%) using non-invasive Electroencephalogram (EEG) signals for brain-computer interface (BCI) decoding tasks like image decoding, emotion recognition, auditory spatial attention detection, epilepsy detection, etc. Since these EEG data were usually collected with well-designed paradigms in labs, the reliability and robustness of the corresponding decoding methods were doubted by some researchers, and they proposed that such decoding accuracy was overestimated due to the inherent temporal autocorrelations (TA) of EEG signals [7]–[9].
However, the coupling between the stimulus-driven neural responses and the EEG temporal autocorrelations makes it difficult to confirm whether this overestimation exists in truth. Some researchers also argue that the effect of TA in EEG data on decoding is negligible and that it becomes a significant problem only under specific experimental designs in which subjects do not have enough resting time [10], [11].
Due to a lack of problem formulation previous studies [7]–[9] only proposed that block-design should not be used to avoid the pitfall. However, the impact of TA could be avoided only when the trial of EEG was not further segmented into several samples. Otherwise, the overfitting or pitfall would still occur. In contrast, when the correct data splitting strategy was used (e.g. separating training and test data in time), the pitfall could also be avoided even when block-design was used.
In our framework, we proposed the concept of "domain" to represent the EEG patterns resulting from TA and then used phantom EEG to remove stimulus-driven neural responses for verification. The results confirmed that the TA, always existing in the EEG data, added unique domain features to a continuous segment of EEG. The specific finding is that when the segment of EEG data with the same class label is split into multiple samples, the classifier will associate the sample's class label with the domain features, interfering with the learning of class-related features. This leads to an overestimation of decoding performance for test samples from the domains seen during training, and results in poor accuracy for test samples from unseen domains (as in real-world applications).
Importantly, our work suggests that the key to reducing the impact of EEG TA on BCI decoding is to decouple class-related features from domain features in the actual EEG dataset. Our proposed unified framework serves as a reminder to BCI researchers of the impact of TA on their specific BCI tasks and is intended to guide them in selecting the appropriate experimental design, splitting strategy and model construction.
We must point out that the "phantom EEG" indeed does not contain any "EEG" but records only noise, a watermelon is not a brain and does not generate any electrical signals. Therefore, the recorded electrical noises, even when amplified using equipment typically used for EEG, do not constitute EEG data when considering the definition of EEG. This is why previous researchers called it "phantom EEG". Some researchers may therefore think that it is questionable to use watermelon to get the phantom EEG.
However, the usage of the phantom head allows researchers to evaluate the performance of neural-recording equipment and proposed algorithms without the effects of neural activity variability, artifacts, and potential ethical issues. Phantom heads used in previous studies include digital models [12]–[14], real human skulls [15]–[17], artificial physical phantoms [18]–[24] and watermelons [25]–[40]. Due to their similar conductivity to human tissue, similar size and shape to the human head, and ease of acquisition, watermelons are widely used as "phantom heads".
Most works tried to use watermelon as a phantom head and found that the results analyzed using the neural signals from human subjects could not be obtained when using the phantom head, thus proving that the achieved results were indeed caused by neural signals. For example, Mutanen et.al [35] proposed that “the fact that the phantom head stimulation did not evoke similar biphasic artifacts excludes the possibility that residual induced artifacts, with the current TMS-compatible EEG system, could explain these components”.
Our work differs significantly from most previous works. It is firstly found in our work that the phantom EEG exhibits the effect of TA on BCI decoding even when only noise was recorded, indicating the inherent existence of TA in the EEG data. The conclusion we hope to draw is that some current works may not truly use stimulus-driven neural
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Reliable prediction of two fundamental human pharmacokinetic (PK) parameters, systemic clearance (CL) and apparent volume of distribution (Vd), determine the size and frequency of drug dosing and are at the heart of drug discovery and development. Traditionally, estimated CL and Vd are derived from preclinical in vitro and in vivo absorption, distribution, metabolism, and excretion (ADME) measurements. In this paper, we report quantitative structure–activity relationship (QSAR) models for prediction of systemic CL and steady-state Vd (Vdss) from intravenous (iv) dosing in humans. These QSAR models avoid uncertainty associated with preclinical-to-clinical extrapolation and require two-dimensional structure drawing as the sole input. The clean, uniform training sets for these models were derived from the compilation published by Obach et al. (Drug Metab. Disp. 2008, 36, 1385–1405). Models for CL and Vdss were developed using both a support vector regression (SVR) method and a multiple linear regression (MLR) method. The SVR models employ a minimum of 2048-bit fingerprints developed in-house as structure quantifiers. The MLR models, on the other hand, are based on information-rich electro-topological states of two-atom fragments as descriptors and afford reverse QSAR (RQSAR) analysis to help model-guided, in silico modulation of structures for desired CL and Vdss. The capability of the models to predict iv CL and Vdss with acceptable accuracy was established by randomly splitting data into training and test sets. On average, for both CL and Vdss, 75% of test compounds were predicted within 2.5-fold of the value observed and 90% of test compounds were within 5.0-fold of the value observed. The performance of the final models developed from 525 compounds for CL and 569 compounds for Vdss was evaluated on an external set of 56 compounds. The predictions were either better or comparable to those predicted by other in silico models reported in the literature. To demonstrate the practical application of the RQSAR approach, the structure of vildagliptin, a high-CL and a high-Vdss compound, is modified based on the atomic contributions to its predicted CL and Vdss to propose compounds with lower CL and lower Vdss.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source data of the North Sea well inventory: United Kingdom (UK) - Oil and Gas Authority (Dec. 2018) - https://data-ogauthority.opendata.arcgis.com/datasets/oga-wells-ed50 Contains information provided by the OGA. Wells are extracted for the area of the PGS data set PGS Mega Survey Plus. We measured the distance between all wells of the test group (n = 43) and all those who are within the seismic data set (n = 1,792; presented here) and their closest bright spot with polarity reversal. Furthermore, we calculated the mean RMS amplitudes and RMS amplitude standard deviation for a buffer radius of 300 m around the well paths for all wells inside the seismic data set and the visited wells as 300 m is the distance below which all of the visited wells of the test group showed gas release in form of flares from the seafloor. We test, if the propensity of a well to leak can be identified by using a logistic regression, which includes regressors such as well activity data and/or derived parameters such as mean RMS amplitude and mean RMS amplitude standard deviation, the distance towards the most proximal bright spot with polarity reversal and age (spud date). In order to identify the most suitable regressor combination best subset selection is employed. The main selection criterion chosen was the prediction accuracy from randomly and repeatedly splitting the visited wells into a training and a test set and then using the fitted logistic regression to predict the test data. The most suitable subset turns out to only employ the distance to polarity reversal, producing a prediction accuracy of 89% and the following logistic regression results: In order to obtain confidence intervals using the normal distribution the distance to bright spot with polarity reversal has to be normally distributed, which it is not. Yet it can be transformed to normality by adding 100 meters to the original distance and then taking the natural logarithm: Logistic regression fit for leakage of all visited wells using distance to bright spot with polarity reversal in meters as a regressor. Please find further information on the applied statistical analyses in the supplementary material. Estimate Std. Error z value Pr(>|z|) Significance Intercept 4,853.946 1,735.128 2.797 0.00515 0.01 Distance −0.007361 0.002700 −2.726 0.00640 0.01 The transformed logistic regression model is then used to predict the probabilities of leakage for the wells within our seismic data set in the Central North Sea (here presented data). In order to obtain confidence bands this logistic regression is performed subtracting and adding two standard deviations from the calculated probability. The point estimate predicts leakage from 926 of the 1,792 wells, where the 95% confidence interval ranges from 719 to 1,058.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.