44 datasets found

h
warvan-ml-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
Explore at:
Authors
warvan
Description
Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
PAMAP2 dataset preprocessed v0.3.0
zenodo.org
explore.openaire.eu
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer; Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer (2020). PAMAP2 dataset preprocessed v0.3.0 [Dataset]. http://doi.org/10.5281/zenodo.834467
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.834467
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer; Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Processed PAMAP2 dataset
This dataset is based on the [PAMAP2 Dataset for Physical Activity Monitoring](https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring).

Compared to v0.2.0, this preprocessed dataset contains fewer activities. It only includes: lying, sitting, standing, walking, cycling, vaccuum_cleaning and ironing

The data is processed with the code from [this script]https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_pamap2.py), with the following function call:

```python
columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
exclude_activities = [5, 7, 9, 10, 11, 12, 13, 18, 19, 20, 24, 0]
outputpath = tutorial_pamap2.fetch_and_preprocess(directory_to_extract_to,columns_to_use,
exclude_activities=exclude_activities,
val_test_size=(100, 1000))

```

## References
A. Reiss and D. Stricker. Introducing a New Benchmarked Dataset for Activity Monitoring. The 16th IEEE International Symposium on Wearable Computers (ISWC), 2012.
Data from: COVID-19 and media dataset: Mining textual data according periods...
dataverse.cirad.fr
application/x-gzip +1
Updated Dec 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
Explore at:
application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
Unique identifier
https://doi.org/10.18167/DVN1/ZUA8MF
Dataset updated
Dec 21, 2020
Dataset provided by
Centre de coopération internationale en recherche agronomique pour le développementhttps://www.cirad.fr/
Authors
Mathieu Roche; Mathieu Roche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom, Spain, France
Dataset funded by
ANR (#DigitAg)
Horizon 2020 - European Commission - (MOOD project)
Description
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
Data from: A Python-based pipeline for preprocessing LC-MS data for...
data.niaid.nih.gov
xml
Updated Nov 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NICOLAS ZABALEGUI (2020). A Python-based pipeline for preprocessing LC-MS data for untargeted metabolomics workflows [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls1919
Explore at:
xmlAvailable download formats
Dataset updated
Nov 21, 2020
Dataset provided by
CIBION-CONICET
Authors
NICOLAS ZABALEGUI
Variables measured
Metabolomics
Description
Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of unwanted features (retention time; m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces a package for the Python programming language for pre-processing LC-MS data for quality control procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are showcased with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of plasma candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies. in addition to NIST SRM 1950 – Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.
Temperature and Humidity Time Series of Cold Storage Room Monitoring
zenodo.org
bin, csv, png, zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll (2025). Temperature and Humidity Time Series of Cold Storage Room Monitoring [Dataset]. http://doi.org/10.5281/zenodo.15130001
Explore at:
png, bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15130001
Dataset updated
Jun 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.

This dataset consists of the following files:

Raw.zip - The raw data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. These files can contain multiple headers.

Preprocessed.zip - The preprocessed data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. Multiple headers were removed, and the length of the datasets was aligned to equal length by filling missing values with NaN.

DataPreprocessing.ipynb - Jupyter Notebook containing the code to preprocess the data and create the overview file, which summarizes key characteristics of the dataset.

DataPreliminaryAnalysis.ipynb - Jupyter Notebook containing the code to perform the preliminary data analysis (general statistics, peaks, and matrix profiles).

experiment_actions.csv - CSV file logging performed actions (door openings and sensor movements).

overview.csv - CSV file summarizing key characteristics of the dataset and preliminary data analysis.

temphum_logger.ino - Source code to run the Arduino-based data logger with a sampling rate of 5 sec.

Arduino_setup_sketch_v1.png - Circuit diagram of the Arduino-based data logger.
d
CUAHSI Workshop 3: Configuring and Running a NextGen Simulation and...
dataone.org
hydroshare.org
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Garousi-Nejad; Anthony M. Castronova (2025). CUAHSI Workshop 3: Configuring and Running a NextGen Simulation and Analyzing Model Outputs [Dataset]. https://dataone.org/datasets/sha256%3A6e7cae1512b4f15aec44c6ee4252b6f9c92ac49370f354ae75a3dfbd2b49e8f4
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
Hydroshare
Authors
Irene Garousi-Nejad; Anthony M. Castronova
Description
This resource includes materials for the workshop about configuring and running a NextGen simulation and analyzing model outputs, presented during the 2025 NWCSI Bootcamp.
Z
Supercoiling-mediated feedback simulation dataset
data.niaid.nih.gov
zenodo.org
Updated Sep 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Galloway, Kate E (2022). Supercoiling-mediated feedback simulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7041640
Explore at:
Dataset updated
Sep 7, 2022
Dataset provided by
Johnstone, Christopher P
Galloway, Kate E
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supercoiling-mediated feedback simulation dataset

Background These files represent simulation datasets generated for the publication "Supercoiling-mediated feedback rapidly couples and tunes transcription" by Christopher Johnstone and Kate E. Galloway.

All figures in the paper can be replicated by using the code available at https://github.com/GallowayLabMIT/tangles_model (permalink) and these datasets.

File summary

unprocessed_datasets.zip contains the merged Julia simulation files.

preprocessed_datasets.zip contains the smaller, preprocessed datasets used for actuall plotting of data figures.

File format

The preprocessed datasets are serialized Pandas dataframes (gzipped Parquet files).

The unprocessed datasets are self-describing HDF/H5 files.

Usage The main figure-plotting notebook, notebooks/modeling_paper_figures.ipynb, contained in the code repository mentioned above can use either the unprocessed or preprocessed datasets. If the preprocessed datasets are present, it will load them directly. If the preprocessed datasets are not present, that Python notebook will preprocess the data.

License This data is available under a CC-BY 4.0 International License. Please attribute:

Christopher Johnstone (cjohnsto@mit.edu)

Kate E. Galloway (katiegal@mit.edu)
e
sdaas - a Python tool computing an amplitude anomaly score of seismic data...
b2find.eudat.eu
Updated Jun 29, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2007). sdaas - a Python tool computing an amplitude anomaly score of seismic data and metadata using simple machine learning algorithm - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b0ff5f26-69b6-597b-a879-299e3c5118f1
Explore at:
Dataset updated
Jun 29, 2007
Description
The increasingly high number of big data applications in seismology has made quality control tools to filter, discard, or rank data of extreme importance. In this framework, machine learning algorithms, already established in several seismic applications, are good candidates to perform the task flexibility and efficiently. sdaas (seismic data/metadata amplitude anomaly score) is a Python library and command line tool for detecting a wide range of amplitude anomalies on any seismic waveform segment such as recording artifacts (e.g., anomalous noise, peaks, gaps, spikes), sensor problems (e.g., digitizer noise), metadata field errors (e.g., wrong stage gain in StationXML). The underlying machine learning model, based on the isolation forest algorithm, has been trained and tested on a broad variety of seismic waveforms of different length, from local to teleseismic earthquakes to noise recordings from both broadband and accelerometers. For this reason, the software assures a high degree of flexibility and ease of use: from any given input (waveform in miniSEED format and its metadata as StationXML, either given as file path or FDSN URLs), the computed anomaly score is a probability-like numeric value in [0, 1] indicating the degree of belief that the analyzed waveform represents an anomaly (or outlier), where scores ≤0.5 indicate no distinct anomaly. sdaas can be employed for filtering malformed data in a pre-process routine, assign robustness weights, or be used as metadata checker by computing randomly selected segments from a given station/channel: in this case, a persistent sequence of high scores clearly indicates problems in the metadata
Z
Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...
data.niaid.nih.gov
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleš Simončič (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
Explore at:
Dataset updated
Jan 6, 2023
Dataset provided by
Mihael Mohorčič
Aleš Simončič
Andrej Hrovat
Miha Mohorčič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

Related dataset

Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

Measurement setup

The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

Data preprocessing

The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:

{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

Folder structure

For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

Environments description

The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

Known dataset shortcomings

Due to technical and physical limitations, the dataset contains some identified deficiencies.

PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

Location 1 - Piazza del Duomo - Chierici

The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

Location 2 - Via Etnea - Piazza del Duomo

The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

Location 3 - Via Etnea - Piazza Università

Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

Location 4 - Piazza Università

This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

Recognitions

The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
r
Utility Functions for Victorian On-bike Cycling Legacy Dataset
researchdata.edu.au
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lingheng Meng (2023). Utility Functions for Victorian On-bike Cycling Legacy Dataset [Dataset]. http://doi.org/10.26180/22358221.V3
Explore at:
Unique identifier
https://doi.org/10.26180/22358221.V3
Dataset updated
Apr 3, 2023
Dataset provided by
Monash University
Authors
Lingheng Meng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zipped file includes the dataset in .csv file and python scripts used to preprocess video data.
Apple Leaf Disease Detection Using Vision Transformer
zenodo.org
text/x-python
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15702007
Dataset updated
Jun 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amreen Batool; Amreen Batool
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Table of Contents

Introduction

Code Explanation

Steps for Implementation

Example Usage

Conclusion

Introduction

The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Code Explanation

1. Importing Libraries

The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

2. Visualizing the Dataset

The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.

The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

3. Data Augmentation

The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.

Separate generators are created for training, validation, and test datasets.

4. Patch Visualization

The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.

The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

5. Model Training

The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.

The model is trained for a specified number of epochs, and the training history is stored for later analysis.

6. Model Evaluation

After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.

The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

7. Visualizing Misclassified Images

The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

8. Fine-Tuning and Learning Rate Adjustment

The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

Steps for Implementation

Dataset Preparation

Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).

Install Required Libraries

Install the necessary Python libraries using pip:

pip install tensorflow matplotlib seaborn numpy pandas scikit-learn

Run the Script

Execute the script in a Python environment. The script will automatically:

Load and preprocess the dataset.

Apply data augmentation.

Train the Vision Transformer model.

Evaluate the model and generate performance metrics.

Analyze Results

Review the confusion matrix and classification report to understand the model's performance.

Visualize misclassified images to identify potential areas for improvement.

Fine-Tuning

Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
h
WordProject
huggingface.co
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2025). WordProject [Dataset]. https://huggingface.co/datasets/ai4bharat/WordProject
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
AI4Bharat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from WordProject, a subset of BhasaAnuvaad.

How to use

The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/WordProject.
Z
Wrist-mounted IMU data towards the investigation of free-living human eating...
data.niaid.nih.gov
Updated Jun 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyritsis, Konstantinos (2022). Wrist-mounted IMU data towards the investigation of free-living human eating behavior - the Free-living Food Intake Cycle (FreeFIC) dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4420038
Explore at:
Dataset updated
Jun 20, 2022
Dataset provided by
Delopoulos, Anastasios
Diou, Christos
Kyritsis, Konstantinos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.

Description

FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.

The script(s) and the pickle file must be located in the same directory.

Tested with Python 3.6.4

Requirements: Numpy, Pickle and Matplotlib

Calculate and echo dataset statistics

$ python stats_dataset.py

Visualize signals and ground truth

$ python viz_dataset.py

FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.

Publications

If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:

@article{kyritsis2020data,
title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
journal={IEEE Journal of Biomedical and Health Informatics}, year={2020},
publisher={IEEE}}

@inproceedings{kyritsis2017automated, title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
year={2019}, organization={IEEE}}

Technical details

We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:

import pickle as pkl import numpy as np

with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)

The dataset variable in the snipet above is a dictionary with (5) keys. Namely:

'subject_id'

'session_id'

'signals_raw'

'signals_proc'

'meal_gt'

The contents under a specific key can be obtained by:

sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth

The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.

sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).

ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.

raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.

proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).

Ethics and funding

Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.

Contact

Any inquiries regarding the FreeFIC dataset should be addressed to:

Dr. Konstantinos KYRITSIS

Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr
IE GSI Hyperspectral Sample Data RAW
hub.arcgis.com
opendata-geodata-gov-ie.hub.arcgis.com
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geological Survey Ireland (2024). IE GSI Hyperspectral Sample Data RAW [Dataset]. https://hub.arcgis.com/documents/370923c450d64760b4c110fa1a38e9f0
Explore at:
Dataset updated
Oct 23, 2024
Dataset provided by
Geological Survey of Ireland
Authors
Geological Survey Ireland
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geological Survey Ireland has a core scanning suite consisting of a Short-Wave Infra-red (SWIR) camera and a Medium-Wave Infra-red (MWIR) camera.We have over 400km of drill core in our core store and are in the process of scanning all of it. We currently have ~7Tb of data.This data is freely available, but due to the size of the files please email gsi.corestore[AT]gsi.ie so we can facilitate delivery.This is a sample dataset consisting of 1 box of core.A single core-box scanned in the Short Wave Infra-red range for use with explanatory notebooks available on our GitHub repository. This data consists of box 25 of drillhole GSI-17-007, 105.98m to 110.35m. This box contains the contact between the Ballymore Formation and the Oakport Formation.We are open to collaboration using either the scanner or the data with any of our stakeholders.For questions, issues, suggestions for improvement or to discuss collaboration, please contact Russell Rogers, c/o duty.geologist[AT]gsi.ie.We also have a GitHub repository that hosts notebooks using the sample dataset, explaining some of the methods we have used in python to pre-process and process our image data.1. Opening and Starting with Geological Survey Ireland Hyperpectral Data2. Denoising Geological Survey Ireland Hyperspectral Data3. Removing the core box from the image4. Removing the continuum5. ClusteringThe notebook uses the Minisom module, because it is a very lightweight implementation with minimal dependencies, but there are many other SOM implementations available in python.
d
HUN GW Model v01
data.gov.au
researchdata.edu.au
+2more
Updated Aug 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2023). HUN GW Model v01 [Dataset]. https://data.gov.au/data/dataset/90554dbf-4992-49ec-98b1-53c6067e97a2
Explore at:
Dataset updated
Aug 9, 2023
Dataset authored and provided by
Bioregional Assessment Program
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

The Hunter groundwater model. This was created using the preprocess.py script in the "HUN GW Model code v01" acting on the index.xml file contained in the top-level directory of this dataset. The index.xml file contains provenance information for the raw-data (HUN GW Model Mines raw data v01) and tells the preprocessing scripts where to find this raw data. As the groundwater model is successively built using the python scripts, provenance is successively added to the files generated (as headers, or similar data structures). The exception to this is the finite-element mesh (mesh3D/mesh3D.*) which is the main output of the process, and which eventually contains such an enormous amount of provenance that the code suffers from buffer overflows: therefore its provenance is dumped to mesh3D/provenance_dump periodically.

As the scripts gradually build the groundwater model, they modify index.xml: it acts as a journal file for model creation, allowing provenance backtracking when using any standard xml viewer.

The final MOOSE input files are found in the "simulate" directory, and an example of the final MOOSE output is the "HUN GW Model simulate ua999 pawsey v01" dataset.

Dataset History

Created using preprocess.py found in the "HUN GW Model code v01" dataset, acting on the index.xml file, and hence using the raw data found in the "HUN GW Model Mines raw data v01" dataset.

Dataset Citation

Bioregional Assessment Programme (XXXX) HUN GW Model v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/90554dbf-4992-49ec-98b1-53c6067e97a2.

Dataset Ancestors

Derived From HUN GW Model code v01

Derived From HUN GW Model Mines raw data v01
Singapore Residents dataset
kaggle.com
Updated Aug 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anuj_sahay (2019). Singapore Residents dataset [Dataset]. https://www.kaggle.com/anujsahay112/singapore-residents-dataset/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anuj_sahay
Area covered
Singapore
Description
Context

This dataset is in context of the real world data science work and how the data analyst and data scientist work.

Content

The dataset consists of four columns Year, Level_1(Ethnic group/gender), Level_2(Age group), and population

Acknowledgements

I would sincerely thank GeoIQ for sharing this dataset with me along with tasks. Just having a basic knowledge of Pandas and Numpy and other python data science libraries is not enough. How can you execute tasks and how can you preprocess the data before making any prediction is very important. Most of the datasets in Kaggle are clean and well arranged but this dataset thought me how real world data science and analysis works. Every data science beginner must work on this dataset and try to execute the tasks. It would only give them a good exposer to the real data science world.

Inspiration

Identify the largest Ethnic group in Singapore. Their average population growth over the years and what proportion of the total population do they constitute.

Identify the largest age group in Singapore. Their average population growth over the years and what proportion of the total population do they constitute.

Identify the group (by age, ethnicity and gender) that: a. Has shown the highest growth rate b. Has shown the lowest growth rate c. Has remained the same

Plot a graph for population trends
f
Open data: Neural electrophysiological correlates of detection and...
su.figshare.com
researchdata.se
html
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Wiens (2024). Open data: Neural electrophysiological correlates of detection and identification awareness [Dataset]. http://doi.org/10.17045/sthlmuni.21354195.v2
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.17045/sthlmuni.21354195.v2
Dataset updated
Aug 30, 2024
Dataset provided by
Stockholm University
Authors
Stefan Wiens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
New version: The python scripts to run the lab experiment were added.Open data: Neural electrophysiological correlates of detection and identification awarenessSupplementary material for the associated publication.OVERVIEWHumans have conscious experiences of the events in their environment. Previous research from electroencephalography (EEG) has shown visual awareness negativity (VAN) at about 200 ms to be a neural correlate of consciousness (NCC). In the present study, the stimulus was a ring with a Gabor patch tilting either left or right. On each trial, subjects rated their awareness on a three-level perceptual awareness scale that captured both detection (something vs. nothing) and identiﬁcation (identiﬁcation vs. something). Separate staircases were used to adjust stimulus opacity to the detection threshold and the identiﬁcation threshold. Event-related potentials were extracted for VAN and late positivity.DATE & LOCATION OF DATA COLLECTION:Subjects (N = 43, student volunteers) were tested between 2022-maj-23 and 2022-june-30 at the Department of Psychology, Campus Albano, Stockholm, Sweden.DATA & FILE OVERVIEWThe files contain the raw data, scripts, and results of main and supplementary analyses of the electroencephalography (EEG) study reported in the main publication.For convenience, the report files of the main analyses in the manuscript are saved separately.Visual awareness negativity (VAN) results: analysis_VANo_clean_data_blocklength_16_pawarelimit0.8_maxopadetect_maxopaidentify_badEEGyes_ntrials25.htmlLate positivity (LP) results: analysis_LPo_clean_data_blocklength_16_pawarelimit0.8_maxopadetect_maxopaidenify_badEEGyes_ntrials25.htmlbdf_up_to_20.zip: contains EEG data files for the first 20 subjects in .bdf format (generated by the Biosemi amplifier)bdf_after_20.zip: contains EEG data files for the remaining subjects in .bdf format (generated by the Biosemi amplifier)Log.zip: contains log files of the EEG session (generated by Python)readme_notes_on_id.txt: Information about issues during data collectionpsychopy.zip: contains scripts in python and psychopy to run the experiment. Scripts were written by Rasmus Eklund.MNE-python.zip: contains scripts in MNE-python to preprocess the EEG data. Scripts were written by Rasmus Eklund.R_graded.zipThe main reports are in R_graded > results > reports. They are .html files generated with Quarto.photodiode_supplement.pdf: Supplementary analysis of the relationship between python opacity settings and actual changes on the computer screenMETHODOLOGICAL INFORMATIONThe visual stimuli were gabor-grated rings. Subjects rated their awareness of the rings. Event-related potentials were computed from the EEG data.The experiment was programmed in Python: https://www.python.org/The EEG data were recorded as .bdf files with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com)Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#R and relevant packages: https://www.r-project.org/
d
Data from: The effect of initial vortex asymmetric structure on tropical...
datadryad.org
data.niaid.nih.gov
zip
Updated Aug 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Gao; Yuqing Wang (2023). The effect of initial vortex asymmetric structure on tropical cyclone intensity change in response to an imposed environmental vertical wind shear [Dataset]. http://doi.org/10.5061/dryad.vt4b8gtz1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.vt4b8gtz1
Dataset updated
Aug 15, 2023
Dataset provided by
Dryad
Authors
Qi Gao; Yuqing Wang
Time period covered
Jul 11, 2023
Description
The original data generated by our idealized experiments using the WRF model is very large, so we used Fortran (you can also use Python, MATLAB and other tools) to preprocess the data and obtain the main variables needed for our research analysis. The preprocessed data is in binary format. The WRF model is a numerical weather prediction and atmospheric research model developed by organizations including the National Center for Atmospheric Research (NCAR) and the National Centers for Environmental Prediction (NCEP) in the USA. WRF is open-source software and can be downloaded from https://github.com/wrf-model/WRF/releases. The specific parameters and settings used to configure the WRF model runs are described in detail in the paper. Interested researchers can follow the settings in the paper to regenerate the original raw data. However, the raw data files are very large (tens of GB per file), making direct analysis difficult. Therefore, we used tools like Fortran to preprocess the raw da...
h
SeamlessAlign
huggingface.co
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2025). SeamlessAlign [Dataset]. https://huggingface.co/datasets/ai4bharat/SeamlessAlign
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
AI4Bharat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from SeamlessAlign, a subset of BhasaAnuvaad.

How to use

The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/SeamlessAlign.
n
Data from: Dependence of tropical cyclone weakening rate in response to an...
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Gao; Yuqing Wang (2024). Dependence of tropical cyclone weakening rate in response to an imposed moderate environmental vertical wind shear on the warm-core strength and height of the initial vortex [Dataset]. http://doi.org/10.5061/dryad.xgxd254nq
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xgxd254nq
Dataset updated
Mar 11, 2024
Dataset provided by
University of Hawaiʻi at Mānoa
Fudan University
Authors
Qi Gao; Yuqing Wang
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This study investigated the dependence of the early tropical cyclone (TC) weakening rate in response to an imposed moderate environmental vertical wind shear (VWS) on the warm-core strength and height of the TC vortex using idealized numerical simulations. Results show that the weakening of the warm core by upper-level ventilation is the primary factor leading to the early TC weakening in response to an imposed environmental VWS. The upper-level ventilation is dominated by eddy radial advection of the warm-core air. The TC weakening rate is roughly proportional to the warm-core strength and height of the initial TC vortex. The boundary-layer ventilation shows no relationship with the early weakening rate of the TC in response to an imposed moderate VWS. The findings suggest that some previous diverse results regarding the TC weakening in environmental VWS could be partly due to the different warm-core strengths and heights of the initial TC vortex. Methods The original data generated by our idealized experiments using the WRF model is very large, so we used Fortran (you can also use Python, MATLAB and other tools) to preprocess the data and obtain the main variables needed for our research analysis. The preprocessed data is in binary format. The WRF model is a numerical weather prediction and atmospheric research model developed by organizations including the National Center for Atmospheric Research (NCAR) and the National Centers for Environmental Prediction (NCEP) in the USA. WRF is open-source software and can be downloaded from https://github.com/wrf-model/WRF/releases. The specific parameters and settings used to configure the WRF model runs are described in detail in the paper. Interested researchers can follow the settings in the paper to regenerate the original raw data. However, the raw data files are very large (tens of GB per file), making direct analysis difficult. Therefore, we used tools like Fortran to preprocess the raw data into smaller binary files containing the key variables needed for analysis, such as potential temperature, etc. The binary files are around a few hundred MB in size. We strongly recommend that subsequent researchers directly use these preprocessed binary data files, which will greatly simplify the data processing workflow.

Facebook

Twitter

Click to copy link

Link copied

Cite

warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset

warvan-ml-dataset

warvan/warvan-ml-dataset

Explore at:

Authors

warvan

Description

Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

  Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

  Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

Clear search

Close search

Google apps

Main menu

warvan-ml-dataset

PAMAP2 dataset preprocessed v0.3.0

Data from: COVID-19 and media dataset: Mining textual data according periods...

Data from: A Python-based pipeline for preprocessing LC-MS data for...

Temperature and Humidity Time Series of Cold Storage Room Monitoring

CUAHSI Workshop 3: Configuring and Running a NextGen Simulation and...

Supercoiling-mediated feedback simulation dataset

sdaas - a Python tool computing an amplitude anomaly score of seismic data...

Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

Utility Functions for Victorian On-bike Cycling Legacy Dataset

Apple Leaf Disease Detection Using Vision Transformer

Table of Contents

Introduction

Code Explanation

1. Importing Libraries

2. Visualizing the Dataset

3. Data Augmentation

4. Patch Visualization

5. Model Training

6. Model Evaluation

7. Visualizing Misclassified Images

8. Fine-Tuning and Learning Rate Adjustment

Steps for Implementation

WordProject

Wrist-mounted IMU data towards the investigation of free-living human eating...

The script(s) and the pickle file must be located in the same directory.

Tested with Python 3.6.4

Requirements: Numpy, Pickle and Matplotlib

Calculate and echo dataset statistics

Visualize signals and ground truth

IE GSI Hyperspectral Sample Data RAW

HUN GW Model v01

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Singapore Residents dataset

Context

Content

Acknowledgements

Inspiration

Open data: Neural electrophysiological correlates of detection and...

Data from: The effect of initial vortex asymmetric structure on tropical...

SeamlessAlign

Data from: Dependence of tropical cyclone weakening rate in response to an...

warvan-ml-dataset

warvan/warvan-ml-dataset