44 datasets found
  1. h

    warvan-ml-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
    Explore at:
    Authors
    warvan
    Description

    Dataset Name

    This dataset contains structured data for machine learning and analysis purposes.

      Contents
    

    data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

      Usage
    

    Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

    Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

  2. PAMAP2 dataset preprocessed v0.3.0

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer; Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer (2020). PAMAP2 dataset preprocessed v0.3.0 [Dataset]. http://doi.org/10.5281/zenodo.834467
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer; Dafne van Kuppevelt; Vincent van Hees; Christiaan Meijer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # Processed PAMAP2 dataset
    This dataset is based on the [PAMAP2 Dataset for Physical Activity Monitoring](https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring).

    Compared to v0.2.0, this preprocessed dataset contains fewer activities. It only includes: lying, sitting, standing, walking, cycling, vaccuum_cleaning and ironing

    The data is processed with the code from [this script]https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_pamap2.py), with the following function call:

    ```python
    columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
    'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
    'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
    exclude_activities = [5, 7, 9, 10, 11, 12, 13, 18, 19, 20, 24, 0]
    outputpath = tutorial_pamap2.fetch_and_preprocess(directory_to_extract_to,columns_to_use,
    exclude_activities=exclude_activities,
    val_test_size=(100, 1000))

    ```

    ## References
    A. Reiss and D. Stricker. Introducing a New Benchmarked Dataset for Activity Monitoring. The 16th IEEE International Symposium on Wearable Computers (ISWC), 2012.

  3. Data from: COVID-19 and media dataset: Mining textual data according periods...

    • dataverse.cirad.fr
    application/x-gzip +1
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
    Explore at:
    application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
    Dataset updated
    Dec 21, 2020
    Authors
    Mathieu Roche; Mathieu Roche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom, Spain, France
    Dataset funded by
    ANR (#DigitAg)
    Horizon 2020 - European Commission - (MOOD project)
    Description

    These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],

  4. Data from: A Python-based pipeline for preprocessing LC-MS data for...

    • data.niaid.nih.gov
    xml
    Updated Nov 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NICOLAS ZABALEGUI (2020). A Python-based pipeline for preprocessing LC-MS data for untargeted metabolomics workflows [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls1919
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Nov 21, 2020
    Dataset provided by
    CIBION-CONICET
    Authors
    NICOLAS ZABALEGUI
    Variables measured
    Metabolomics
    Description

    Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of unwanted features (retention time; m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces a package for the Python programming language for pre-processing LC-MS data for quality control procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are showcased with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of plasma candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies. in addition to NIST SRM 1950 – Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.

  5. Temperature and Humidity Time Series of Cold Storage Room Monitoring

    • zenodo.org
    bin, csv, png, zip
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll (2025). Temperature and Humidity Time Series of Cold Storage Room Monitoring [Dataset]. http://doi.org/10.5281/zenodo.15130001
    Explore at:
    png, bin, zip, csvAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.

    This dataset consists of the following files:

    • Raw.zip - The raw data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. These files can contain multiple headers.
    • Preprocessed.zip - The preprocessed data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. Multiple headers were removed, and the length of the datasets was aligned to equal length by filling missing values with NaN.
    • DataPreprocessing.ipynb - Jupyter Notebook containing the code to preprocess the data and create the overview file, which summarizes key characteristics of the dataset.
    • DataPreliminaryAnalysis.ipynb - Jupyter Notebook containing the code to perform the preliminary data analysis (general statistics, peaks, and matrix profiles).
    • experiment_actions.csv - CSV file logging performed actions (door openings and sensor movements).
    • overview.csv - CSV file summarizing key characteristics of the dataset and preliminary data analysis.
    • temphum_logger.ino - Source code to run the Arduino-based data logger with a sampling rate of 5 sec.
    • Arduino_setup_sketch_v1.png - Circuit diagram of the Arduino-based data logger.
  6. d

    CUAHSI Workshop 3: Configuring and Running a NextGen Simulation and...

    • dataone.org
    • hydroshare.org
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Garousi-Nejad; Anthony M. Castronova (2025). CUAHSI Workshop 3: Configuring and Running a NextGen Simulation and Analyzing Model Outputs [Dataset]. https://dataone.org/datasets/sha256%3A6e7cae1512b4f15aec44c6ee4252b6f9c92ac49370f354ae75a3dfbd2b49e8f4
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    Hydroshare
    Authors
    Irene Garousi-Nejad; Anthony M. Castronova
    Description

    This resource includes materials for the workshop about configuring and running a NextGen simulation and analyzing model outputs, presented during the 2025 NWCSI Bootcamp.

  7. Z

    Supercoiling-mediated feedback simulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Galloway, Kate E (2022). Supercoiling-mediated feedback simulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7041640
    Explore at:
    Dataset updated
    Sep 7, 2022
    Dataset provided by
    Johnstone, Christopher P
    Galloway, Kate E
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supercoiling-mediated feedback simulation dataset

    Background These files represent simulation datasets generated for the publication "Supercoiling-mediated feedback rapidly couples and tunes transcription" by Christopher Johnstone and Kate E. Galloway.

    All figures in the paper can be replicated by using the code available at https://github.com/GallowayLabMIT/tangles_model (permalink) and these datasets.

    File summary

    unprocessed_datasets.zip contains the merged Julia simulation files.

    preprocessed_datasets.zip contains the smaller, preprocessed datasets used for actuall plotting of data figures.

    File format

    The preprocessed datasets are serialized Pandas dataframes (gzipped Parquet files).

    The unprocessed datasets are self-describing HDF/H5 files.

    Usage The main figure-plotting notebook, notebooks/modeling_paper_figures.ipynb, contained in the code repository mentioned above can use either the unprocessed or preprocessed datasets. If the preprocessed datasets are present, it will load them directly. If the preprocessed datasets are not present, that Python notebook will preprocess the data.

    License This data is available under a CC-BY 4.0 International License. Please attribute:

    Christopher Johnstone (cjohnsto@mit.edu)

    Kate E. Galloway (katiegal@mit.edu)

  8. e

    sdaas - a Python tool computing an amplitude anomaly score of seismic data...

    • b2find.eudat.eu
    Updated Jun 29, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). sdaas - a Python tool computing an amplitude anomaly score of seismic data and metadata using simple machine learning algorithm - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b0ff5f26-69b6-597b-a879-299e3c5118f1
    Explore at:
    Dataset updated
    Jun 29, 2007
    Description

    The increasingly high number of big data applications in seismology has made quality control tools to filter, discard, or rank data of extreme importance. In this framework, machine learning algorithms, already established in several seismic applications, are good candidates to perform the task flexibility and efficiently. sdaas (seismic data/metadata amplitude anomaly score) is a Python library and command line tool for detecting a wide range of amplitude anomalies on any seismic waveform segment such as recording artifacts (e.g., anomalous noise, peaks, gaps, spikes), sensor problems (e.g., digitizer noise), metadata field errors (e.g., wrong stage gain in StationXML). The underlying machine learning model, based on the isolation forest algorithm, has been trained and tested on a broad variety of seismic waveforms of different length, from local to teleseismic earthquakes to noise recordings from both broadband and accelerometers. For this reason, the software assures a high degree of flexibility and ease of use: from any given input (waveform in miniSEED format and its metadata as StationXML, either given as file path or FDSN URLs), the computed anomaly score is a probability-like numeric value in [0, 1] indicating the degree of belief that the analyzed waveform represents an anomaly (or outlier), where scores ≤0.5 indicate no distinct anomaly. sdaas can be employed for filtering malformed data in a pre-process routine, assign robustness weights, or be used as metadata checker by computing randomly selected segments from a given station/channel: in this case, a persistent sequence of high scores clearly indicates problems in the metadata

  9. Z

    Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

    • data.niaid.nih.gov
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleš Simončič (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
    Explore at:
    Dataset updated
    Jan 6, 2023
    Dataset provided by
    Mihael Mohorčič
    Aleš Simončič
    Andrej Hrovat
    Miha Mohorčič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

    This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

    It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

    Related dataset

    Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

    Measurement setup

    The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

    The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

    The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

    Data preprocessing

    The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

    PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

    Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
    Missing IE fields in the captured PR are not included in PR_IE_DATA.

    When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

    {'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

    where PR_data is structured as follows:

    { 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

    This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

    At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

    Folder structure

    For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

    The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

    Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

    Environments description

    The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

    Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

    Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

    Known dataset shortcomings

    Due to technical and physical limitations, the dataset contains some identified deficiencies.

    PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

    Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

    The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

     Location 1 - Piazza del Duomo - Chierici
    

    The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

     Location 2 - Via Etnea - Piazza del Duomo
    

    The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

     Location 3 - Via Etnea - Piazza Università
    

    Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

     Location 4 - Piazza Università
    

    This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

    Recognitions

    The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.

  10. r

    Utility Functions for Victorian On-bike Cycling Legacy Dataset

    • researchdata.edu.au
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lingheng Meng (2023). Utility Functions for Victorian On-bike Cycling Legacy Dataset [Dataset]. http://doi.org/10.26180/22358221.V3
    Explore at:
    Dataset updated
    Apr 3, 2023
    Dataset provided by
    Monash University
    Authors
    Lingheng Meng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This zipped file includes the dataset in .csv file and python scripts used to preprocess video data.

  11. Apple Leaf Disease Detection Using Vision Transformer

    • zenodo.org
    text/x-python
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amreen Batool; Amreen Batool
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Table of Contents

    Introduction

    The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Code Explanation

    1. Importing Libraries

    • The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

    2. Visualizing the Dataset

    • The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.
    • The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

    3. Data Augmentation

    • The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.
    • Separate generators are created for training, validation, and test datasets.

    4. Patch Visualization

    • The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.
    • The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

    5. Model Training

    • The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.
    • The model is trained for a specified number of epochs, and the training history is stored for later analysis.

    6. Model Evaluation

    • After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.
    • The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

    7. Visualizing Misclassified Images

    • The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

    8. Fine-Tuning and Learning Rate Adjustment

    • The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

    Steps for Implementation

    1. Dataset Preparation

      • Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).
    2. Install Required Libraries

      • Install the necessary Python libraries using pip:
        pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
    3. Run the Script

      • Execute the script in a Python environment. The script will automatically:
        • Load and preprocess the dataset.
        • Apply data augmentation.
        • Train the Vision Transformer model.
        • Evaluate the model and generate performance metrics.
    4. Analyze Results

      • Review the confusion matrix and classification report to understand the model's performance.
      • Visualize misclassified images to identify potential areas for improvement.
    5. Fine-Tuning

      • Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
  12. h

    WordProject

    • huggingface.co
    Updated Jan 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2025). WordProject [Dataset]. https://huggingface.co/datasets/ai4bharat/WordProject
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    AI4Bharat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

      Overview
    

    BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from WordProject, a subset of BhasaAnuvaad.

      How to use
    

    The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/WordProject.

  13. Z

    Wrist-mounted IMU data towards the investigation of free-living human eating...

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyritsis, Konstantinos (2022). Wrist-mounted IMU data towards the investigation of free-living human eating behavior - the Free-living Food Intake Cycle (FreeFIC) dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4420038
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Delopoulos, Anastasios
    Diou, Christos
    Kyritsis, Konstantinos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.

    Description

    FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.

    The script(s) and the pickle file must be located in the same directory.

    Tested with Python 3.6.4

    Requirements: Numpy, Pickle and Matplotlib

    Calculate and echo dataset statistics

    $ python stats_dataset.py

    Visualize signals and ground truth

    $ python viz_dataset.py

    FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.

    Publications

    If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:

    @article{kyritsis2020data,
    title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
    author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
    journal={IEEE Journal of Biomedical and Health Informatics}, year={2020},
    publisher={IEEE}}

    @inproceedings{kyritsis2017automated, title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
    year={2019}, organization={IEEE}}

    Technical details

    We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import numpy as np

    with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snipet above is a dictionary with (5) keys. Namely:

    'subject_id'

    'session_id'

    'signals_raw'

    'signals_proc'

    'meal_gt'

    The contents under a specific key can be obtained by:

    sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth

    The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.

    sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).

    ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.

    raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.

    proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).

    Ethics and funding

    Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.

    Contact

    Any inquiries regarding the FreeFIC dataset should be addressed to:

    Dr. Konstantinos KYRITSIS

    Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

    Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr

  14. IE GSI Hyperspectral Sample Data RAW

    • hub.arcgis.com
    • opendata-geodata-gov-ie.hub.arcgis.com
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geological Survey Ireland (2024). IE GSI Hyperspectral Sample Data RAW [Dataset]. https://hub.arcgis.com/documents/370923c450d64760b4c110fa1a38e9f0
    Explore at:
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Geological Survey of Ireland
    Authors
    Geological Survey Ireland
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geological Survey Ireland has a core scanning suite consisting of a Short-Wave Infra-red (SWIR) camera and a Medium-Wave Infra-red (MWIR) camera.We have over 400km of drill core in our core store and are in the process of scanning all of it. We currently have ~7Tb of data.This data is freely available, but due to the size of the files please email gsi.corestore[AT]gsi.ie so we can facilitate delivery.This is a sample dataset consisting of 1 box of core.A single core-box scanned in the Short Wave Infra-red range for use with explanatory notebooks available on our GitHub repository. This data consists of box 25 of drillhole GSI-17-007, 105.98m to 110.35m. This box contains the contact between the Ballymore Formation and the Oakport Formation.We are open to collaboration using either the scanner or the data with any of our stakeholders.For questions, issues, suggestions for improvement or to discuss collaboration, please contact Russell Rogers, c/o duty.geologist[AT]gsi.ie.We also have a GitHub repository that hosts notebooks using the sample dataset, explaining some of the methods we have used in python to pre-process and process our image data.1. Opening and Starting with Geological Survey Ireland Hyperpectral Data2. Denoising Geological Survey Ireland Hyperspectral Data3. Removing the core box from the image4. Removing the continuum5. ClusteringThe notebook uses the Minisom module, because it is a very lightweight implementation with minimal dependencies, but there are many other SOM implementations available in python.

  15. d

    HUN GW Model v01

    • data.gov.au
    • researchdata.edu.au
    • +2more
    Updated Aug 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2023). HUN GW Model v01 [Dataset]. https://data.gov.au/data/dataset/90554dbf-4992-49ec-98b1-53c6067e97a2
    Explore at:
    Dataset updated
    Aug 9, 2023
    Dataset authored and provided by
    Bioregional Assessment Program
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    The Hunter groundwater model. This was created using the preprocess.py script in the "HUN GW Model code v01" acting on the index.xml file contained in the top-level directory of this dataset. The index.xml file contains provenance information for the raw-data (HUN GW Model Mines raw data v01) and tells the preprocessing scripts where to find this raw data. As the groundwater model is successively built using the python scripts, provenance is successively added to the files generated (as headers, or similar data structures). The exception to this is the finite-element mesh (mesh3D/mesh3D.*) which is the main output of the process, and which eventually contains such an enormous amount of provenance that the code suffers from buffer overflows: therefore its provenance is dumped to mesh3D/provenance_dump periodically.

    As the scripts gradually build the groundwater model, they modify index.xml: it acts as a journal file for model creation, allowing provenance backtracking when using any standard xml viewer.

    The final MOOSE input files are found in the "simulate" directory, and an example of the final MOOSE output is the "HUN GW Model simulate ua999 pawsey v01" dataset.

    Dataset History

    Created using preprocess.py found in the "HUN GW Model code v01" dataset, acting on the index.xml file, and hence using the raw data found in the "HUN GW Model Mines raw data v01" dataset.

    Dataset Citation

    Bioregional Assessment Programme (XXXX) HUN GW Model v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/90554dbf-4992-49ec-98b1-53c6067e97a2.

    Dataset Ancestors

  16. Singapore Residents dataset

    • kaggle.com
    Updated Aug 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anuj_sahay (2019). Singapore Residents dataset [Dataset]. https://www.kaggle.com/anujsahay112/singapore-residents-dataset/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anuj_sahay
    Area covered
    Singapore
    Description

    Context

    This dataset is in context of the real world data science work and how the data analyst and data scientist work.

    Content

    The dataset consists of four columns Year, Level_1(Ethnic group/gender), Level_2(Age group), and population

    Acknowledgements

    I would sincerely thank GeoIQ for sharing this dataset with me along with tasks. Just having a basic knowledge of Pandas and Numpy and other python data science libraries is not enough. How can you execute tasks and how can you preprocess the data before making any prediction is very important. Most of the datasets in Kaggle are clean and well arranged but this dataset thought me how real world data science and analysis works. Every data science beginner must work on this dataset and try to execute the tasks. It would only give them a good exposer to the real data science world.

    Inspiration

    1. Identify the largest Ethnic group in Singapore. Their average population growth over the years and what proportion of the total population do they constitute.
    2. Identify the largest age group in Singapore. Their average population growth over the years and what proportion of the total population do they constitute.
    3. Identify the group (by age, ethnicity and gender) that: a. Has shown the highest growth rate b. Has shown the lowest growth rate c. Has remained the same
    4. Plot a graph for population trends
  17. f

    Open data: Neural electrophysiological correlates of detection and...

    • su.figshare.com
    • researchdata.se
    html
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Wiens (2024). Open data: Neural electrophysiological correlates of detection and identification awareness [Dataset]. http://doi.org/10.17045/sthlmuni.21354195.v2
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Stockholm University
    Authors
    Stefan Wiens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    New version: The python scripts to run the lab experiment were added.Open data: Neural electrophysiological correlates of detection and identification awarenessSupplementary material for the associated publication.OVERVIEWHumans have conscious experiences of the events in their environment. Previous research from electroencephalography (EEG) has shown visual awareness negativity (VAN) at about 200 ms to be a neural correlate of consciousness (NCC). In the present study, the stimulus was a ring with a Gabor patch tilting either left or right. On each trial, subjects rated their awareness on a three-level perceptual awareness scale that captured both detection (something vs. nothing) and identification (identification vs. something). Separate staircases were used to adjust stimulus opacity to the detection threshold and the identification threshold. Event-related potentials were extracted for VAN and late positivity.DATE & LOCATION OF DATA COLLECTION:Subjects (N = 43, student volunteers) were tested between 2022-maj-23 and 2022-june-30 at the Department of Psychology, Campus Albano, Stockholm, Sweden.DATA & FILE OVERVIEWThe files contain the raw data, scripts, and results of main and supplementary analyses of the electroencephalography (EEG) study reported in the main publication.For convenience, the report files of the main analyses in the manuscript are saved separately.Visual awareness negativity (VAN) results: analysis_VANo_clean_data_blocklength_16_pawarelimit0.8_maxopadetect_maxopaidentify_badEEGyes_ntrials25.htmlLate positivity (LP) results: analysis_LPo_clean_data_blocklength_16_pawarelimit0.8_maxopadetect_maxopaidenify_badEEGyes_ntrials25.htmlbdf_up_to_20.zip: contains EEG data files for the first 20 subjects in .bdf format (generated by the Biosemi amplifier)bdf_after_20.zip: contains EEG data files for the remaining subjects in .bdf format (generated by the Biosemi amplifier)Log.zip: contains log files of the EEG session (generated by Python)readme_notes_on_id.txt: Information about issues during data collectionpsychopy.zip: contains scripts in python and psychopy to run the experiment. Scripts were written by Rasmus Eklund.MNE-python.zip: contains scripts in MNE-python to preprocess the EEG data. Scripts were written by Rasmus Eklund.R_graded.zipThe main reports are in R_graded > results > reports. They are .html files generated with Quarto.photodiode_supplement.pdf: Supplementary analysis of the relationship between python opacity settings and actual changes on the computer screenMETHODOLOGICAL INFORMATIONThe visual stimuli were gabor-grated rings. Subjects rated their awareness of the rings. Event-related potentials were computed from the EEG data.The experiment was programmed in Python: https://www.python.org/The EEG data were recorded as .bdf files with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com)Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#R and relevant packages: https://www.r-project.org/

  18. d

    Data from: The effect of initial vortex asymmetric structure on tropical...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Gao; Yuqing Wang (2023). The effect of initial vortex asymmetric structure on tropical cyclone intensity change in response to an imposed environmental vertical wind shear [Dataset]. http://doi.org/10.5061/dryad.vt4b8gtz1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Dryad
    Authors
    Qi Gao; Yuqing Wang
    Time period covered
    Jul 11, 2023
    Description

    The original data generated by our idealized experiments using the WRF model is very large, so we used Fortran (you can also use Python, MATLAB and other tools) to preprocess the data and obtain the main variables needed for our research analysis. The preprocessed data is in binary format. The WRF model is a numerical weather prediction and atmospheric research model developed by organizations including the National Center for Atmospheric Research (NCAR) and the National Centers for Environmental Prediction (NCEP) in the USA. WRF is open-source software and can be downloaded from https://github.com/wrf-model/WRF/releases. The specific parameters and settings used to configure the WRF model runs are described in detail in the paper. Interested researchers can follow the settings in the paper to regenerate the original raw data. However, the raw data files are very large (tens of GB per file), making direct analysis difficult. Therefore, we used tools like Fortran to preprocess the raw da...

  19. h

    SeamlessAlign

    • huggingface.co
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2025). SeamlessAlign [Dataset]. https://huggingface.co/datasets/ai4bharat/SeamlessAlign
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    AI4Bharat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

      Overview
    

    BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from SeamlessAlign, a subset of BhasaAnuvaad.

      How to use
    

    The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/SeamlessAlign.

  20. n

    Data from: Dependence of tropical cyclone weakening rate in response to an...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Gao; Yuqing Wang (2024). Dependence of tropical cyclone weakening rate in response to an imposed moderate environmental vertical wind shear on the warm-core strength and height of the initial vortex [Dataset]. http://doi.org/10.5061/dryad.xgxd254nq
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    University of Hawaiʻi at Mānoa
    Fudan University
    Authors
    Qi Gao; Yuqing Wang
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This study investigated the dependence of the early tropical cyclone (TC) weakening rate in response to an imposed moderate environmental vertical wind shear (VWS) on the warm-core strength and height of the TC vortex using idealized numerical simulations. Results show that the weakening of the warm core by upper-level ventilation is the primary factor leading to the early TC weakening in response to an imposed environmental VWS. The upper-level ventilation is dominated by eddy radial advection of the warm-core air. The TC weakening rate is roughly proportional to the warm-core strength and height of the initial TC vortex. The boundary-layer ventilation shows no relationship with the early weakening rate of the TC in response to an imposed moderate VWS. The findings suggest that some previous diverse results regarding the TC weakening in environmental VWS could be partly due to the different warm-core strengths and heights of the initial TC vortex. Methods The original data generated by our idealized experiments using the WRF model is very large, so we used Fortran (you can also use Python, MATLAB and other tools) to preprocess the data and obtain the main variables needed for our research analysis. The preprocessed data is in binary format. The WRF model is a numerical weather prediction and atmospheric research model developed by organizations including the National Center for Atmospheric Research (NCAR) and the National Centers for Environmental Prediction (NCEP) in the USA. WRF is open-source software and can be downloaded from https://github.com/wrf-model/WRF/releases. The specific parameters and settings used to configure the WRF model runs are described in detail in the paper. Interested researchers can follow the settings in the paper to regenerate the original raw data. However, the raw data files are very large (tens of GB per file), making direct analysis difficult. Therefore, we used tools like Fortran to preprocess the raw data into smaller binary files containing the key variables needed for analysis, such as potential temperature, etc. The binary files are around a few hundred MB in size. We strongly recommend that subsequent researchers directly use these preprocessed binary data files, which will greatly simplify the data processing workflow.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset

warvan-ml-dataset

warvan/warvan-ml-dataset

Explore at:
Authors
warvan
Description

Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

  Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

  Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

Search
Clear search
Close search
Google apps
Main menu