Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
768MB dataset about garbage classification, divided into train, test, val folder
Facebook
TwitterNote we use the split files in splits folder to get the following splits as there are too many chunks after preprocessing: "train_grid1.0cm_chunk6x6_stride3x3_filtered", "val_grid1.0cm_chunk6x6_stride3x3_filtered", "test_grid1.0cm_chunk6x6_stride3x3_filtered".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.
Facebook
Twitter
Facebook
Twittermizuno-group/patent-time-split-supplementary-folder dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterTradingView Ideas Dataset
This dataset contains trading ideas and analysis sourced from TradingView, split into training and testing datasets for machine learning purposes. It includes both image data (chart screenshots) and associated textual descriptions.
Dataset Structure
Root Folder Contents
train.zip: Compressed folder containing training data (images and JSON split). test.zip: Compressed folder containing testing data (images and JSON split).… See the full description on the dataset page: https://huggingface.co/datasets/DiljitSingh14/tradingIdeas.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 Ă— 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:
magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.
M24/M48: both present the following sub-folders structure:
Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:
inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:
magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where:
is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where:
is Seq16 if refers to a sequence, or void if refers direct to images.
"24h" or "48h".
is "TrainVal" or "Test". The refers to the split of Train/Val.
void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders:
void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test);
"24h" or "48h";
"oneSplit" for a specific split or "allSplits" if run all splits.
void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where:
point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt:
train or val;
void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where:
k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where:
k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where:
epoch number of the checkpoint;
corresponding valid loss;
0 to 4.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"Towards Robotic Mapping of a Honeybee Comb" Dataset This dataset supports the analyses and experiments of the paper: J. Janota et al., "Towards Robotic Mapping of a Honeybee Comb," 2024 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS), Delft, Netherlands, 2024, doi: 10.1109/MARSS61851.2024.10612712. Link to Paper | Link to Code Repository Cell Detection The celldet_2023 dataset contains a total of 260 images of the honeycomb (at resolution 67 µm per pixel), with masks from the ViT-H Segment Anything Model (SAM) and annotations for these masks. The structure of the dataset is following:celldet_2023├── {image_name}.png├── ...├── masksH (folder with masks for each image)├────{image_name}.json├────...├── annotations├────annotated_masksH (folder with annotations for training images)├──────{image_name in training part}.csv├──────...├────annotated_masksH_val (folder with annotations for validation images)├──────{image_name in validation part}.csv}├──────...├────annotated_masksH_test (folder with annotations for test images)├──────{image_name in test part}.csv}├──────... Masks For each image there is a .json file that contains all the masks produced by the SAM for the particular image, the masks are in COCO Run-Length Encoding (RLE) format. Annotations The annotation files are split into folders based on whether they were used for training, validation or testing. For each image (and thus also for each .json file with masks), there is a .csv file with two columns: Column id Description 0 order id of the mask in the corresponding .json file 1 mask label: 1 if fully visible cell, 2 if partially occluded cell, 0 otherwise Loading the Dataset For an example of loading the data, see the data loader in the paper repository: python cell_datasetV2.py --img_dir --mask_dir
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gold Standard annotations for SMM4H-Spanish shared task and unannotated test and background files. SMM4H 2021 accepted at NAACL (scheduled in Mexico City in June) https://2021.naacl.org/. Introduction: The entire corpus contains 10,000 annotated tweets. It has been split into training, validation and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations.In addition, it contains the unannotated test and background sets will be released. Participants must submit predictions for the files under the directory "test-background-txt-files"For subtask-1 (classification), annotations are distributed in a tab-separated file (TSV). The TSV format follows the format employed in SMM4H 2019 Task 2:tweet_id class For subtask-2 (Named Entity Recognition, profession detection), annotations are distributed in 2 formats: Brat standoff and TSV. See the Brat webpage for more information about the Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2:tweet_id begin end type extraction. In addition, we provide a tokenized version of the dataset, for participant's convenience. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization.Zip structure:subtask-1: files of tweet classification subtask. Content: One TSV file per corpus split (train and valid).train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-directory per corpus split (train and valid). train-valid-txt-files-english: folder with training and validation text files Machine Translated to English.test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab.subtask-2: files of Named Entity Recognition subtask. Content:brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid)TSV: folder with annotations in TSV. One file per corpus split (train and valid)BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid)train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-directory per corpus split (train and valid) train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab.Annotation quality: We have performed a consistency analysis of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts following the same annotation guidelines. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919. Important shared task information: SYSTEM PREDICTIONS MUST FOLLOW THE TSV FORMAT. And systems will only be evaluated for the PROFESION and SITUACION_LABORAL predictions (despite the Gold Standard contains 2 extra entity classes). For more information about the evaluation scenario, see the Codalab link, or the evaluation webpage. For further information, please visit https://temu.bsc.es/smm4h-spanish/ or email us at encargo-pln-life@bsc.es Resources:WebAnnotation guidelines (in Spanish) Annotation guidelines (in English) FastText COVID-19 Twitter embeddings Occupations gazetteer
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview This dataset contains input-output data of a coupled mass-spring-damper system with a nonlinear force profile. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system coupled-msd. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for… See the full description on the dataset page: https://huggingface.co/datasets/dany-l-23/coupled-msd.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prerequisites for running the experiments:
Steps to run the experiments:
single_parameter_experiment.zip or multi_parameter_experiment.zip).cd single_parameter_experiment).poetry install --only main to install all dependencies.poetry run python run_experiment.py.data/.The modules used for the experiment are defined in the file experiment_modules.py and to see the experiment configuration, look in run_experiment.py.
Warning: the experiments take several weeks to run on a single machine, therefore it is advisable to split the experiments based on modules and run them in parallel.
Prerequisites for running the analysis:
Steps to run the analysis:
analysis-adaptive-parameter-control.zip).cd analysis-adaptive-parameter-control).conda env create -f environment.ymlTrace data single.zip and Trace data multi.zip. .nc files in the corresponding folder: analysis-adaptive-parameter-control/single_parameter/ or analysis-adaptive-parameter-control/multi_parameter/.
coverage_rate_model_single_parameter.nc goes in the single_parameter folder, while the coverage_rate_model_multi_parameter.nc goes in the multi_parameter folder.Notebooks/).coverage_rate_multi_parameter.ipynb, coverage_rate_single_parameter.ipynb, final_coverage_multi_parameter.ipynb, final_coverage_single_parameter.ipynb, overhead_model_multi_parameter.ipynb, or overhead_model_single_parameter.ipynb).Notebooks/).coverage_rate_multi_parameter.ipynb, coverage_rate_single_parameter.ipynb, final_coverage_multi_parameter.ipynb, final_coverage_single_parameter.ipynb, overhead_model_multi_parameter.ipynb, or overhead_model_single_parameter.ipynb).Warning: this will take a long time, if you don't have the time, use the following alternative instead
The data from when we ran the experiments is available in the Single data.zip and Multi data.zip files.
The structure of these are the following:
statistics.csv file containing some information about each run and their branch coverage timelines.Prerequisites for running the parameter assignment analysis:
Steps to run the parameter assignment analysis:
parameter-assignment.zip).cd parameter-assignment).conda env create -f environment.ymlNotebooks/).parameter_assignment_analysis.ipynb.
Facebook
TwitterDataset is from https://www.robots.ox.ac.uk/~vgg/data/flowers/17/. It contains images of flowers from 17 different species. 80 images per class/species are contained in the dataset for a total of 1360 images. A test split of 25% has been applied, resulting to 20 image per class for test set & 60 image per class for train & validation set. Then, restructured to folder per class format, and folder for test vs training & validation split.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Load classification dataset This algorithm allows to load a classification dataset from a given folder. It can also split the dataset into train and validation folders....
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ESC50
This is an audio classification dataset for Environmental Sound Classification. Classes = 50 , Split = Train-Test
Structure
audios folder contains audio files. csv_files folder contains CSV files for five-fold cross-validation. To perform cross-validation on fold 1, train_1.csv will be used for the training split and test_1.csv for the testing split, with the same pattern followed for the other folds. To perform training and testing witout cross-validation… See the full description on the dataset page: https://huggingface.co/datasets/MahiA/ESC50.
Facebook
TwitterFile Restoration and Extraction Guide
File Structure
Root directory: Contains Part 1 split files part2/ directory: Contains Part 2 split files
Instructions
Step 1: File Restoration
Due to size limitations, the original file has been split. To restore the complete file: cat images_1024.part_* > images_1024.tar
Step 2: Extraction
To extract the contents: tar -xvf images_1024.tar
Important Notes
For Part 1 images: Execute… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/AnyInstruct-resolution-1024.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was generated for the purpose of training convolutional neural network (CNN) models for permeability prediction of 2D structures. The whole dataset is part of a study on predicting permeability using CNNs, while addressing discussions that are largely absent from the current literature, such as the effect of data diversity in the accuracy, input pre-processing, error estimation, architecture comparisons, and sources of error. A link to the publication, which includes a lot more detail about the dataset and CNN models, will be added once it is published.
The data included in this dataset is split into three different folders. The data under the "Training Data" folder includes 4,500 images, divided in 15 sub-folders. Each sub-folder contains 901 files, which are 300 images, a pressure-velocity map for each of the 300 structures, convergence data for each individual structure, and one comma-separated file (csv) summarizing all simulation results in the folder. The pressure and velocity maps together with the convergence information are direct results of the CFD algorithm used, but the important information for training the CNN models are the images and the permeability data in the csv files.
The "Trained CNNs" folder contains all of the trained CNN models as described in the linked publication for predicting permeability. That includes the ensemble of VGG19 networks. The "External Test Set" includes the same type of data as the "Training Data" folder, but this section of data was only used to test the CNN models. In other words, the trained CNN models never saw any of this data in training, only in testing. The "External Test Set" folder also includes data for phase-size distributions and surface area for all data in this repository. For more details on those, refer to the publication.
The CFD code and the image generation code can be found in the following GitHub, along with more extensive documentation: https://github.com/adama-wzr/PixelBasedPermeability/
Facebook
TwitterEach Videos ZIP Segment file is a segment of a split ZIP. To access the videos in the split ZIP, download all Videos ZIP Segment files that are part of the split ZIP to a single folder (including the Videos ZIP Target file that has the full “.zip” extension), rename all of the downloaded files to have the same root filename (e.g., rename “pone.0308790.s001.z01” to “Videos.z01”, rename “pone.0308790.s002.z02” to “Videos.z02”, rename “pone.0308790.s003.z03” to “Videos.z03”, etc.), then open the file with the “.zip” extension (e.g., “Videos.zip”), navigate the folders within the ZIP and select a video to open. Alternatively, after downloading all ZIP and ZXX files into a folder, unzip all files by opening the .ZIP file with a compressing software such as WinRAR, WinZip, or 7-Zip. This will give access to the 177 GorillaFACS video examples organized in their respective folders for each AU and AD. https://doi.org/10.1371/journal.pone.0308790.s001 (Z01)
Facebook
TwitterThis dataset was created by Ramsi Kalia
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
768MB dataset about garbage classification, divided into train, test, val folder