58 datasets found
  1. h

    SPP_30K_reasoning_tasks

    • huggingface.co
    Updated Aug 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farouk (2023). SPP_30K_reasoning_tasks [Dataset]. https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2023
    Authors
    Farouk
    Description

    Dataset Card for "SPP_30K_verified_tasks"

      Dataset Summary
    

    This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.

  2. P

    CodeSearchNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
    Description

    The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

  3. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  4. h

    Python-codes

    • huggingface.co
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arjun G Ravi (2023). Python-codes [Dataset]. https://huggingface.co/datasets/Arjun-G-Ravi/Python-codes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Authors
    Arjun G Ravi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    Please note that this dataset maynot be perfect and may contain a very small quantity of non python codes. But the quantity appears to be very small

      Dataset Summary
    

    The dataset contains a collection of python question and their code. This is meant to be used for training models to be efficient in Python specific coding. The dataset has two features - 'question' and 'code'. An example is: {'question': 'Create a function that takes in a string… See the full description on the dataset page: https://huggingface.co/datasets/Arjun-G-Ravi/Python-codes.

  5. MNIST IDX Dataset- Fasion

    • kaggle.com
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShreyaSuresh (2025). MNIST IDX Dataset- Fasion [Dataset]. https://www.kaggle.com/datasets/shreyasuresh0407/mnist-idx-dataset-fasion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ShreyaSuresh
    Description

    📦 About the Dataset

    This project uses a classic machine learning dataset of handwritten digits — the MNIST dataset — stored in IDX format.

    🧠 Each image is a 28x28 pixel grayscale picture of a handwritten number from 0 to 9. Your task is to teach a simple neural network (your "brain") to recognize these digits.

    🔍 What’s Inside?

    File NameDescription
    train-images-idx3-ubyte🖼️ 60,000 training images (28x28 pixels each)
    train-labels-idx1-ubyte🔢 Labels (0–9) for each training image
    t10k-images-idx3-ubyte🖼️ 10,000 test images
    t10k-labels-idx1-ubyte🔢 Labels (0–9) for test images

    All files are in the IDX binary format, which is compact and fast for loading, but needs to be parsed using a small Python function (see below 👇).

    ###✨ Why This Dataset Is Awesome

    • 🎯 It's the “Hello World” of machine learning — perfect for beginners
    • 📊 Ideal for testing image classification algorithms
    • 🧠 Helps you learn how neural networks "see" numbers
    • 💥 Small enough to train quickly, powerful enough to learn real skills

    🧩 Sample Image

    (Add this cell below in your notebook to visualize a few images)

    import matplotlib.pyplot as plt
    
    # Show the first 10 images
    fig, axes = plt.subplots(1, 10, figsize=(15, 2))
    for i in range(10):
      axes[i].imshow(train_images[i][0], cmap="gray")
      axes[i].set_title(f"Label: {train_labels[i].item()}")
      axes[i].axis("off")
    plt.show()
    
  6. Python functions -- cross-validation methods from a data-driven perspective

    • zenodo.org
    • phys-techsciences.datastations.nl
    bin, txt, zip
    Updated Aug 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanwen Wang; Yanwen Wang (2024). Python functions -- cross-validation methods from a data-driven perspective [Dataset]. http://doi.org/10.17026/pt/txau9w
    Explore at:
    txt, bin, zipAvailable download formats
    Dataset updated
    Aug 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yanwen Wang; Yanwen Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 28, 2024
    Description

    This is the organized python functions of proposed methods in Yanwen Wang PhD research. Researchers can directly use these functions to conduct spatial+ cross-validation (SP-CV), dissimilarity quantification by adversarial validation (AVD), and dissimilarity-adaptive cross-validation (DA-CV). The description of how to run codes is in Readme.txt. The descriptions of functions are in functions.docx.

  7. Software Defects Dataset 1k

    • kaggle.com
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar R N (2025). Software Defects Dataset 1k [Dataset]. https://www.kaggle.com/datasets/ravikumarrn/software-defects-dataset-1k/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravikumar R N
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📦 Software Defects Multilingual Dataset with AST & Token Features

    This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.

    🙋 Please Citation

    If you use this dataset in your research or project, please cite it as:

    "Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."

    🧠 Dataset Highlights

    • Languages Included: Python, Java, JavaScript, C, C++, Go, Rust
    • Records: 1,000 code snippets
    • Labels: defect (1 = buggy, 0 = clean)
    • Features:

      • token_count: Total tokens (AST-based for Python)
      • num_ifs, num_returns, num_func_calls: Code structure features
      • ast_nodes: Number of nodes in the abstract syntax tree (Python only)
      • lines_of_code & cyclomatic_complexity: Simulated metrics for modeling

      📊 Columns Description

    ColumnDescription
    function_nameUnique identifier for the function
    codeThe actual function source code
    languageProgramming language used
    lines_of_codeApproximate number of lines in the function
    cyclomatic_complexitySimulated measure of decision complexity
    defect1 = buggy, 0 = clean
    token_countTotal token count (Python uses AST tokens)
    num_ifsCount of 'if' statements
    num_returnsCount of 'return' statements
    num_func_callsNumber of function calls
    ast_nodesAST node count (Python only, fallback = token count)

    🛠️ Usage Examples

    This dataset is suitable for:

    • Training traditional ML models like Random Forests or XGBoost
    • Evaluating prompt-based or fine-tuned LLMs (e.g., CodeT5, GPT-4)
    • Feature importance studies using AST and static code metrics
    • Cross-lingual transfer learning in code understanding

    📎** License**

    This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.

  8. W

    HUN GW Uncertainty Analysis v01

    • cloud.csiss.gmu.edu
    • researchdata.edu.au
    • +2more
    zip
    Updated Dec 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australia (2019). HUN GW Uncertainty Analysis v01 [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 13, 2019
    Dataset provided by
    Australia
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.

    References:

    Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.

    Dataset History

    This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).

    A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).

    R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.

    The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).

    Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.

    Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01

    These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv

    Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv.

    The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.

    The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.

    The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.

    Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.

    The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv

    Dataset Citation

    Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.

    Dataset Ancestors

  9. Z

    DNP3 Intrusion Detection Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panagiotis (2024). DNP3 Intrusion Detection Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7348493
    Explore at:
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Thomas
    Panagiotis
    Vasiliki
    Vasileios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1.Introduction

    In the digital era of the Industrial Internet of Things (IIoT), the conventional Critical Infrastructures (CIs) are transformed into smart environments with multiple benefits, such as pervasive control, self-monitoring and self-healing. However, this evolution is characterised by several cyberthreats due to the necessary presence of insecure technologies. DNP3 is an industrial communication protocol which is widely adopted in the CIs of the US. In particular, DNP3 allows the remote communication between Industrial Control Systems (ICS) and Supervisory Control and Data Acquisition (SCADA). It can support various topologies, such as Master-Slave, Multi-Drop, Hierarchical and Multiple-Server. Initially, the architectural model of DNP3 consists of three layers: (a) Application Layer, (b) Transport Layer and (c) Data Link Layer. However, DNP3 can be now incorporated into the Transmission Control Protocol/Internet Protocol (TCP/IP) stack as an application-layer protocol. However, similarly to other industrial protocols (e.g., Modbus and IEC 60870-5-104), DNP3 is characterised by severe security issues since it does not include any authentication or authorisation mechanisms. More information about the DNP3 security issue is provided in [1-3]. This dataset contains labelled Transmission Control Protocol (TCP) / Internet Protocol (IP) network flow statistics (Common-Separated Values - CSV format) and DNP3 flow statistics (CSV format) related to 9 DNP3 cyberattacks. These cyberattacks are focused on DNP3 unauthorised commands and Denial of Service (DoS). The network traffic data are provided through Packet Capture (PCAP) files. Consequently, this dataset can be used to implement Artificial Intelligence (AI)-powered Intrusion Detection and Prevention (IDPS) systems that rely on Machine Learning (ML) and Deep Learning (DL) techniques.

    2.Instructions

    This DNP3 Intrusion Detection Dataset was implemented following the methodological frameworks of A. Gharib et al. in [4] and S. Dadkhah et al in [5], including eleven features: (a) Complete Network Configuration, (b) Complete Traffic, (c) Labelled Dataset, (d) Complete Interaction, (e) Complete Capture, (f) Available Protocols, (g) Attack Diversity, (h) Heterogeneity, (i) Feature Set and (j) Metadata.

    A network topology consisting of (a) eight industrial entities, (b) one Human Machine Interfaces (HMI) and (c) three cyberattackers was used to implement this DNP3 Intrusion Detection Dataset. In particular, the following cyberattacks were implemented.

    On Thursday, May 14, 2020, the DNP3 Disable Unsolicited Messages Attack was executed for 4 hours.

    On Friday, May 15, 2020, the DNP3 Cold Restart Message Attack was executed for 4 hours.

    On Friday, May 15, 2020, the DNP3 Warm Restart Message Attack was executed for 4 hours.

    On Saturday, May 16, 2020, the DNP3 Enumerate Attack was executed for 4 hours.

    On Saturday, May 16, 2020, the DNP3 Info Attack was executed for 4 hours.

    On Monday, May 18, 2020, the DNP3 Initialisation Attack was executed for 4 hours.

    On Monday, May 18, 2020, the Man In The Middle (MITM)-DoS Attack was executed for 4 hours.

    On Monday, May 18, 2020, the DNP3 Replay Attack was executed for 4 hours.

    On Tuesday, May 19, 2020, the DNP3 Stop Application Attack was executed for 4 hours.

    The aforementioned DNP3 cyberattacks were executed, utilising penetration testing tools, such as Nmap and Scapy. For each attack, a relevant folder is provided, including the network traffic and the network flow statistics for each entity. In particular, for each cyberattack, a folder is given, providing (a) the pcap files for each entity, (b) the Transmission Control Protocol (TCP)/ Internet Protocol (IP) network flow statistics for 120 seconds in a CSV format and (c) the DNP3 flow statistics for each entity (using different timeout values in terms of second (such as 45, 60, 75, 90, 120 and 240 seconds)). The TCP/IP network flow statistics were produced by using the CICFlowMeter, while the DNP3 flow statistics were generated based on a Custom DNP3 Python Parser, taking full advantage of Scapy.

    1. Dataset Structure

    The dataset consists of the following folders:

    20200514_DNP3_Disable_Unsolicited_Messages_Attack: It includes the pcap and CSV files related to the DNP3 Disable Unsolicited Message attack.

    20200515_DNP3_Cold_Restart_Attack: It includes the pcap and CSV files related to the DNP3 Cold Restart attack.

    20200515_DNP3_Warm_Restart_Attack: It includes the pcap and CSV files related to DNP3 Warm Restart attack.

    20200516_DNP3_Enumerate: It includes the pcap and CSV files related to the DNP3 Enumerate attack.

    20200516_DNP3_Ιnfo: It includes the pcap and CSV files related to the DNP3 Info attack.

    20200518_DNP3_Initialize_Data_Attack: It includes the pcap and CSV files related to the DNP3 Data Initialisation attack.

    20200518_DNP3_MITM_DoS: It includes the pcap and CSV files related to the DNP3 MITM-DoS attack.

    20200518_DNP3_Replay_Attack: It includes the pcap and CSV files related to the DNP3 replay attack.

    20200519_DNP3_Stop_Application_Attack: It includes the pcap and CSV files related to the DNP3 Stop Application attack.

    Training_Testing_Balanced_CSV_Files: It includes balanced CSV files from CICFlowMeter and the Custom DNP3 Python Parser that could be utilised for training ML and DL methods. Each folder includes different sub-folder for the corresponding flow timeout values used by the DNP3 Python Custom Parser. For CICFlowMeter, only the timeout value of 120 seconds was used.

    Each folder includes respective subfolders related to the entities/devices (described in the following section) participating in each attack. In particular, for each entity/device, there is a folder including (a) the DNP3 network traffic (pcap file) related to this entity/device during each attack, (b) the TCP/IP network flow statistics (CSV file) generated by CICFlowMeter for the timeout value of 120 seconds and finally (c) the DNP3 flow statistics (CSV file) from the Custom DNP3 Python Parser. Finally, it is noteworthy that the network flows from both CICFlowMeter and Custom DNP3 Python Parser in each CSV file are labelled based on the DNP3 cyberattacks executed for the generation of this dataset. The description of these attacks is provided in the following section, while the various features from CICFlowMeter and Custom DNP3 Python Parser are presented in Section 5.

    4.Testbed & DNP3 Attacks

    The following figure shows the testbed utilised for the generation of this dataset. It is composed of eight industrial entities that play the role of the DNP3 outstations/slaves, such as Remote Terminal Units (RTUs) and Intelligent Electron Devices (IEDs). Moreover, there is another workstation which plays the role of the Master station like a Master Terminal Unit (MTU). For the communication between, the DNP3 outstations/slaves and the master station, opendnp3 was used.

    Table 1: DNP3 Attacks Description

    DNP3 Attack

    Description

    Dataset Folder

    DNP3 Disable Unsolicited Message Attack

    This attack targets a DNP3 outstation/slave, establishing a connection with it, while acting as a master station. The false master then transmits a packet with the DNP3 Function Code 21, which requests to disable all the unsolicited messages on the target.

    20200514_DNP3_Disable_Unsolicited_Messages_Attack

    DNP3 Cold Restart Attack

    The malicious entity acts as a master station and sends a DNP3 packet that includes the “Cold Restart” function code. When the target receives this message, it initiates a complete restart and sends back a reply with the time window before the restart process.

    20200515_DNP3_Cold_Restart_Attack

    DNP3 Warm Restart Attack

    This attack is quite similar to the “Cold Restart Message”, but aims to trigger a partial restart, re-initiating a DNP3 service on the target outstation.

    20200515_DNP3_Warm_Restart_Attack

    DNP3 Enumerate Attack

    This reconnaissance attack aims to discover which DNP3 services and functional codes are used by the target system.

    20200516_DNP3_Enumerate

    DNP3 Info Attack

    This attack constitutes another reconnaissance attempt, aggregating various DNP3 diagnostic information related the DNP3 usage.

    20200516_DNP3_Ιnfo

    Data Initialisation Attack

    This cyberattack is related to Function Code 15 (Initialize Data). It is an unauthorised access attack, which demands from the slave to re-initialise possible configurations to their initial values, thus changing potential values defined by legitimate masters

    20200518_Initialize_Data_Attack

    MITM-DoS Attack

    In this cyberattack, the cyberattacker is placed between a DNP3 master and a DNP3 slave device, dropping all the messages coming from the DNP3 master or the DNP3 slave.

    20200518_MITM_DoS

    DNP3 Replay Attack

    This cyberattack replays DNP3 packets coming from a legitimate DNP3 master or DNP3 slave.

    20200518_DNP3_Replay_Attack

    DNP3 Step Application Attack

    This attack is related to the Function Code 18 (Stop Application) and demands from the slave to stop its function so that the slave cannot receive messages from the master.

    20200519_DNP3_Stop_Application_Attack

    1. Features

    The TCP/IP network flow statistics generated by CICFlowMeter are summarised below. The TCP/IP network flows and their statistics generated by CICFlowMeter are labelled based on the DNP3 attacks described above, thus allowing the training of ML/DL models. Finally, it is worth mentioning that these statistics are generated when the flow timeout value is equal with 120 seconds.

    Table

  10. Klib library python

    • kaggle.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  11. In-vitro dataset for classification and regression of stenosis: dependence...

    • zenodo.org
    zip
    Updated Nov 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Bernhard; Michelle Wisotzki; Alexander Mair; Stefan Bernhard; Michelle Wisotzki; Alexander Mair (2022). In-vitro dataset for classification and regression of stenosis: dependence on heart rate, waveform and location [Dataset]. http://doi.org/10.5281/zenodo.6421498
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefan Bernhard; Michelle Wisotzki; Alexander Mair; Stefan Bernhard; Michelle Wisotzki; Alexander Mair
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background

    This data supplements the paper "Classification and regression of stenosis using an in-vitro pulse wave dataset:
    dependence on heart rate, waveform and location". It was created at Technische Hochschule Mittelhessen (THM) in Germany and uploaded to Zenodo. Please cite the paper and the Zenodo doi when using this dataset.

    General description / Dataset structure

    Each mat-File describes a different measurement (details can be found in the paper). There are 17 pressure signals for different positions, one flow sensor close to the stenosis location and one monitor signal of the proportional valve use to control the input curve. Total duration of each signal is 60s with a sampling rate of 1000 Hz. Each mat-file contains a header structure with metadata and struct array for signals of each sensor. Signals in each mat-File are aligned with respect to a common time axis, but this is not guaranteed between different measurements/files. We did our best to make the beginnings end endings align as close as possible (by removing buffer artefacts and aligning the input signal of the monitor), however algorithms should not rely on a global time axis. This similar to patient measurements without an ekg, this does also not share a global time axis comparable among patients.

    The file format can either be loaded directly in Matlab or in Python with scipy's loadmat function.

    The data is structure first by stenosis "state" (or location) then by heart rate and then by heart waveform. The stenosis "states" can devided in 1 subset of 10 folders created for regression and 6 created for classification. Excerpt of the folder structure:

    • No Stenosis
      • HR 50
        • WaveForm1.mat
        • WaveForm2.mat
        • ...
      • HR 55
        • ...
      • ...
    • Regression - Stenosis at Pos01
      • HR 50
        • ...
      • ...
    • ...

    The tools also available at this page help with traversing this folder structure and are available for Python and Matlab.

    Data Fields of each file

    headerStruct
    fielddescription
    idinternal database id
    namestenosis location
    ratesampling rate in Hz
    descriptiondefinition of automatic parameter sweep range
    configurationconcrete parameters of the trapezoidal input curve (offset and amplitude in mmHg, ascend times and descend times and smoothing window in a fraction the time period (1.2s))

    signalStruct
    fielddescription
    nodeIdcorresponds to numbered nodes at which the sensor is placed, the corresponding location can be found in the technical paper describing the MACSim simulator (node numbering, not sensor numbers) or in the software SISCA in the example database.
    type'p' ... pressure or 'q' ... flow
    datadouble array, time series of each sensor, unit mmHg for type 'p' and ml/s for type 'q'
    anatomicalPositionname of the corresponding anatomical position

    Tools:

    This Tools should make it easier to load the dataset. The usage is documented in the respective code files.

    Code for the publication is available here:
    https://gitlab.com/agbernhard.lse.thm/publication_macsim_machinelearning

  12. f

    Texas Synthetic Power System Test Case (TX-123BT).zip

    • figshare.com
    zip
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin Lu; Xingpeng Li (2024). Texas Synthetic Power System Test Case (TX-123BT).zip [Dataset]. http://doi.org/10.6084/m9.figshare.22144616.v6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    figshare
    Authors
    Jin Lu; Xingpeng Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Texas
    Description

    The dataset of the synthetic Texas 123-bus backbone transmission (TX-123BT) system.The procedures and details to create TX-123BT system are described in the paper below:Jin Lu, Xingpeng Li et al., “A Synthetic Texas Backbone Power System with Climate-Dependent Spatio-Temporal Correlated Profiles”.If you use this dataset in your work, please cite the paper above.***Introduction:The TX-123BT system has similar temporal and spatial characteristics as the actual Electric Reliability Council of Texas (ERCOT) system.TX-123BT system has a backbone network consisting of only high-voltage transmission lines distributed in the Texas territory.It includes time series profiles of renewable generation, electrical load, and transmission thermal limits for 5 years from 2017 to 2021.The North American Land Data Assimilation System (NLDAS) climate data is extracted and used to create the climate-dependent time series profiles mentioned above.Two sets of climate-dependent dynamic line rating (DLR) profiles are created: (i) daily DLR and (ii) hourly DLR.***Power system configuration data:'Bus_data.csv': Bus data including bus name and location (longitude & latitude, weather zone).'Line_data.csv': Line capacity and terminal bus information.'Generator_data.xlsx': 'Gen_data' sheet: Generator parameters including active/reactive capacity, fuel type, cost and ramping rate.'Solar Plant Number' sheet: Correspondence between the solar plant number and generator number.'Wind Plant Number' sheet: Correspondence between the wind plant number and generator number.***Time series profiles:'Climate_5y' folder: Include each day's climate data for solar radiation, air temperature, wind speed near surface at 10 meter height.Each file in the folder includes the hourly temperature, longwave & shortwave solar radiation, zonal & Meridional wind speed data of a day in 2019.'Hourly_line_rating_5y' folder: Include the hourly dynamic line rating for each day in the year.Each file includes the hourly line rating (MW) of a line for all hours in the year.In each file, columns represent hour 1-24 in a day, rows represent day 1-365 in the year.'Daily_line_rating_5y' folder: The daily dynamic line rating (MW) for all lines and all days in the year.'solar_5y' folder: Solar production for all the solar farms in the TX-123BT and for all the days in the year.Each file includes the hourly solar production (MW) of all the solar plants for a day in the year.In each file, columns represent hour 1-24 in a day, rows represent solar plant 1-72.'wind_5y' folder: Wind production for all the wind farms in the case and for all the days in the year.Each file includes the hourly wind production (MW) of all the wind plants for a day in the year.In each file, columns represent hour 1-24 in a day, rows represent wind plant 1-82.'load_5y' folder: Include each day's hourly load data on all the buses.Each file includes the hourly nodal loads (MW) of all the buses in a day in the year.In each file,columns represent bus 1-123, rows represent hour 1-24 in a day.***Python Codes to run security-constrainted unit commitment (SCUC) for TX-123BT profilesRecommand Python Version: Python 3.11Required packages: Numpy, pyomo, pypower, pickleRequired a solver which can be called by the pyomo to solve the SCUC optimization problem.*'Sample_Codes_SCUC' folder: A standard SCUC model.The load, solar generation, wind generation profiles are provided by 'load_annual','solar_annual', 'wind_annual' folders.The daily line rating profiles are provided by 'Line_annual_Dmin.txt'.'power_mod.py': define the python class for the power system.'UC_function.py': define functions to build, solve, and save results for pyomo SCUC model.'formpyomo_UC': define the function to create the input file for pyomo model.'Run_SCUC_annual': run this file to perform SCUC simulation on the selected days of the TX-123BT profiles.Steps to run SCUC simulation:1) Set up the python environment.2) Set the solver location: 'UC_function.py'=>'solve_UC' function=>UC_solver=SolverFactory('solver_name',executable='solver_location')3) Set the days you want to run SCUC: 'Run_SCUC_annual.py'=>last row: run_annual_UC(case_inst,start_day,end_day)For example: to run SCUC simulations for 125th-146th days in 2019, the last row of the file is 'run_annual_UC(case_inst,125,146)'You can also run a single day's SCUC simulation by using: 'run_annual_UC(case_inst,single_day,single_day)'* 'Sample_Codes_SCUC_HourlyDLR' folder: The SCUC model consider hourly dynamic line rating (DLR) profiles.The load, solar generation, wind generation profiles are provided by 'load_annual','solar_annual', 'wind_annual' folders.The hourly line rating profiles in 2019 are provided by 'dynamic_rating_result' folder.'power_mod.py': define the python class for the power system.'UC_function_DLR.py': define functions to build, solve, and save results for pyomo SCUC model (with hourly DLR).'formpyomo_UC': define the function to create the input file for pyomo model.'RunUC_annual_dlr': run this file to perform SCUC simulation (with hourly DLR) on the selected days of the TX-123BT profiles.Steps to run SCUC simulation (with hourly DLR):1) Set up the python environment.2) Set the solver location: 'UC_function_DLR.py'=>'solve_UC' function=>UC_solver=SolverFactory('solver_name',executable='solver_location')3) Set the daily profiles for SCUC simulation: 'RunUC_annual_dlr.py'=>last row: run_annual_UC_dlr(case_inst,start_day,end_day)For example: to run SCUC simulations (with hourly DLR) for 125th-146th days in 2019, the last row of the file is 'run_annual_UC_dlr(case_inst,125,146)'You can also run a single day's SCUC simulation (with hourly DLR) by using: 'run_annual_UC_dlr(case_inst,single_day,single_day)'The SCUC/SCUC with DLR simulation results are saved in the 'UC_results' folders under the corresponding folder.Under 'UC_results' folder:'UCcase_Opcost.txt': total operational cost ($)'UCcase_pf.txt': the power flow results (MW). Rows represent lines, columns represent hours.'UCcase_pfpct.txt': the percentage of the power flow to the line capacity (%). Rows represent lines, columns represent hours.'UCcase_pgt.txt': the generators output power (MW). Rows represent conventional generators, columns represent hours.'UCcase_lmp.txt': the locational marginal price ($/MWh). Rows represent buses, columns represent hours.***Geographic information system (GIS) data:'Texas_GIS_Data' folder: includes the geographic information systems (GIS) data of the TX-123BT system configurations and ERCOT weather zones.The GIS data can be viewed and edited using GIS software: ArcGIS.The subfolders are:'Bus' folder: the shapefile of bus data for the TX-123BT system.'Line' folder: the shapefile of line data for the TX-123BT system.'Weather Zone' folder: the shapefile of the weather zones in Electric Reliability Council of Texas (ERCOT).*** Maps(Pictures) of the TX-123BT & ERCOT Weather Zone'Maps_TX123BT_WeatherZone' folder:1) 'TX123BT_Noted.jpg': The maps (pictures) of the TX-123BT transmission network. Buses are in blue and lines are in green.2) 'Area_Houston_Noted.jpg', 'Area_Dallas_Noted.jpg', 'Area_Austin_SanAntonio_Noted.jpg':The maps for different areas including Houston, Dallas, and Austin-SanAntonio are also provided.3) 'Weather_Zone.jpg': The map of ERCOT weather zones. It's ploted by author, may be slightly different from the actual ERCOT weather zones.***FundingThis project is supported by Alfred P. Sloan Foundation.***License:This work is licensed under the terms of the Creative Commons Attribution 4.0 (CC BY 4.0) license.***Disclaimer:The author doesn’t make any warranty for the accuracy, completeness, or usefulness of any information disclosed and the author assumes no liability or responsibility for any errors or omissions for the information (data/code/results etc) disclosed.***Contributions:Jin Lu created this dataset. Xingpeng Li supervised this work. Hongyi Li and Taher Chegini provided the raw historical climate data (extracted from an open-access dataset - NLDAS).

  13. d

    Namoi groundwater uncertainty analysis

    • data.gov.au
    • researchdata.edu.au
    • +1more
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). Namoi groundwater uncertainty analysis [Dataset]. https://data.gov.au/data/dataset/groups/36bd27e9-58d2-4bf2-8e4a-54b22ac98cfb
    Explore at:
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    Bioregional Assessment Program
    Area covered
    Namoi River
    Description

    Abstract

    This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    The dataset contains the predictions of maximum drawdown and time to maximum drawdown at all groundwater model nodes in the Namoi subregion, constrained by the observations of groundwater level, river flux and mine water production rates. The dataset also contains the scripts required for and the results of the sensitivity analysis. The dataset contains all the scripts to generate these results from the outputs of the groundwater model (Namoi groundwater model dataset) and all the spreadsheets with the results. The methodology and results are described in Janardhanan et al. (2017)

    References

    Janardhanan S, Crosbie R, Pickett T, Cui T, Peeters L, Slatter E, Northey J, Merrin LE, Davies P, Miotlinski K, Schmid W and Herr A (2017) Groundwater numerical modelling for the Namoi subregion. Product 2.6.2 for the Namoi subregion from the Northern Inland Catchments Bioregional Assessment. Department of the Environment and Energy, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia., http://data.bioregionalassessments.gov.au/product/NIC/NAM/2.6.2.

    Dataset History

    The workflow that underpins this dataset is captured in 'NAM_MF_UA_workflow.png'.

    Spreadsheet NAM_MF_dmax_Predictions_all.csv is sourced from dataset 'Namoi groundwater model' and contains the name, coordinates, Bore_ID in the model, layer number, the name of the objective function and the minimum, maximum, median, 5th percentile and 95th percentile of the design of experiment runs of maximum drawdown (dmax) for each groundwater model node. The individual results for each node for each run of the design of experiment is stored in spreadsheet 'NAM_MF_dmax_DoE_Predictions_all.csv' The equivalent files for time to maximum drawdown (tmax) are 'NAM_MF_tmax_Predictions_all.csv' and 'NAM_MF_tmax_DoE_Predictions_all.csv'.

    These files are combined with the file 'NAM_MF_Observations_all.csv', which contains the observed values for groundwater levels, mine dewatering rates and river flux, and the files NAM_MF_dist_hobs.csv, NAM_MF_dist_rivers.csv, NAM_MF_dist_mines.csv, which contain the distances of the predictions to each mine, groundwater level observation and river, in python script 'NAM_MF_datawranling.csv'. This script selects only those predictions where the 95th percentile of dmax is less than 1 cm for further analysis. The subset of predictions is stored in 'NAM_MF_dmax_Predictions.csv','NAM_MF_tmax_Predictions.csv', 'NAM_MF_dmax_DoE_Predictions.csv','NAM_MF_tmax_DoE_Predictions.csv'. The output spreadsheet 'NAM_MF_Observations.csv' has the observations and the distances to the selected predictions.

    As the simulated equivalents to the observations are part of the predictions dataset, these files are combined in python script NAM_MF_OFs.py to generate the objective function values for each run and each prediction. The objective function values are weighted sums of the residuals, stored in NAM_MF_DoE_hres.csv, NAM_MF_DoE_mres.csv, NAM_MF_DoE_rres.csv, according to the distance to the predictions and the results are stored in NAM_MF_DoE_OFh.csv, NAM_MF_DoE_OFm.csv, NAM_MF_DoE_OFr.csv. The threshold values for each objective function and prediction are stored in NAM_MF_OF_thresholds.csv. Python script NAM_MF_OF_wrangling.py further post-processes this information to generate the acceptance rates, saved in spreadsheet NAM_MF_dmax_Predictions_ARs.csv

    Python script NAM_MF_CreatePosterior.py selects the results from the design of experiment run that satisfy the acceptance criteria. The results form the posterior predictive distributions stored in NAM_MF_dmax_Posterior.csv and NAM_MF_tmax_Posterior.csv. These are further summarised in NAM_MF_Predictions_summary.csv.

    The sensitivity analysis is done with script NAM_MF_SI.py, which uses the results of the design of experiment together with the parameter values, stored in NAM_MF_DoE_Parameters.csv and their description (name, range, transform) in NAM_MF_Parameters.csv. The resulting sensitivity indices for dmax, tmax and river, head and minewater flow observations are stored in NAM_MF_SI_dmax.csv, NAM_MF_SI_tmax.csv, NAM_MF_SI_river.csv, NAM_MF_SI_mine.csv and NAM_MF_SI_head.csv. The intermediate files, ending in xxxx, are the results grouped per 100 predictions. The scripts NAM_MF_SI_collate.py and NAM_MF_SI_collate.slurm collate these.

    Dataset Citation

    Bioregional Assessment Programme (2017) Namoi groundwater uncertainty analysis. Bioregional Assessment Derived Dataset. Viewed 11 December 2018, http://data.bioregionalassessments.gov.au/dataset/36bd27e9-58d2-4bf2-8e4a-54b22ac98cfb.

    Dataset Ancestors

  14. h

    Magicoder-Evol-Instruct-110K-python

    • huggingface.co
    Updated Nov 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2024
    Authors
    pxy
    Description

    Dataset Card for "Magicoder-Evol-Instruct-110K-python"

    from datasets import load_dataset

    Load your dataset

    dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

    Define a filter function

    def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

    … See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

  15. P

    Niko Chord Progression Dataset Dataset

    • paperswithcode.com
    Updated Aug 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Yi; Haochen Hu; Jingwei Zhao; Gus Xia (2022). Niko Chord Progression Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/niko-chord-progression-dataset
    Explore at:
    Dataset updated
    Aug 31, 2022
    Authors
    Li Yi; Haochen Hu; Jingwei Zhao; Gus Xia
    Description

    Introduction The Niko Chord Progression Dataset is used in AccoMontage2. It contains 5k+ chord progression pieces, labeled with styles. There are four styles in total: Pop Standard, Pop Complex, Dark and R&B. Some progressions have an 'Unknown' style. Some statistics are provided below.

    MeanVariance
    Note Pitch57167.70
    Note Velocity79.05457.89
    Note Duration (in seconds)1.381.62

    Data Formats You can access the Niko Chord Progression Dataset in two formats: MIDI format and the quantized note matrix format.

    MIDI (dataset.zip) Each chord progression piece is stored as a single MIDI file.

    Quantized Note Matrix (dataset.pkl) A python dictionary with format like the following. nmatis an 2-d matrix, each row represent a quantized note: [start, end, pitch, velocity]. Each note is quantized at the eighth note level. eg., start=2 means the note begins at the third eighth note. root is also an 2-d matrix. It labels the roots of the chords using an eighth note sample rate. Each row of the root represents a bar. Each element is an integer ranged from 0 (C note) to 11 (B note).

    {'piece name': 
      {'nmat': [[0, 3, 60, 60], ...],  # 2-d matrix: note matrix
       'root': [[0,0,0,0,0,0,0,0], ...], # 2-d matrix: root label
       'style': 'some style',      # pop_standard, pop_complex, dark, r&b, unknown
       'mode': 'some mode',       # M, m
       'tonic': 'some tonic'       # C, Db, ... B
      }, 
     ...
    }
    
    load the dataset using pickle
    import pickle
    with open('dataset_path_and_name.pkl', 'rb') as file:
      dataset = pickle.load(file)
    

    Supplementary description Original Dataset The Niko Chord Progression Dataset is a re-organized version of the original Niko Dataset. The original Niko Dataset have duplicate progressions and unnecessary labels, it was thus processed and converted to this version.

    Style Mapping The style label was mapped from the original dataset to the new dataset. The style label in the original dataset is stored as folder names, and thus the style can be obtained from the file path. The following shows a detailed description of the style mapping function.

    // Structure of the original dataset . ├─A Major - F# Minor ---> progressions are sorted based on tonics and modes │ ├─1 - Best Melodies ---> eliminated │ │ ├─Catchy │ │ ├─Dark_HipHop_Trap │ │ ├─EDM │ │ ├─Emotional │ │ ├─Pop │ │ └─R&B_Neosoul │ ├─2 - Best Chords │ │ ├─Dark_HipHop_Trap ---> New style: Dark │ │ ├─EDM │ │ │ ├─Classy_7th_9th ---> New style: Pop Complex │ │ │ ├─Emotional ---> New style: Pop Complex │ │ │ └─Standard ---> New style: Pop Standard │ │ ├─Emotional ---> New style: Pop Complex │ │ ├─Pop │ │ │ ├─Classy_7th_9th ---> New style: Pop Complex │ │ │ ├─Emotional ---> New style: Pop Complex │ │ │ └─Standard ---> New style: Pop Standard │ │ └─R&B_Neosoul ---> New style: R&B │ └─3 - Rest Of Pack │ ├─A-Bm-D (I-ii-IV) ---> progressions sorted based on root pattern │ │ ├─Arps ---> eliminated │ │ ├─Basslines ---> eliminated │ │ ├─Chord Breakdown ---> New style: Unknown │ │ ├─Chord Progression -> New style: Unknown │ │ ├─Epic Endings ---> eliminated │ │ ├─Fast Chord Rhythm -> eliminated │ │ │ ├─Back & Forth │ │ │ └─Same Time
    │ │ ├─Melodies ---> eliminated │ │ │ ├─115-130bpm │ │ │ ├─130-160bpm │ │ │ ├─160-180bpm │ │ │ └─90-115bpm │ │ └─Slow Chord Rhythm -> New style: Unknown ...

    Cite L. Yi, H. Hu, J. Zhao, and G. Xia, “AccoMontage2: A Complete Harmonization and Accompaniment Arrangement System”, in Proceedings of the 23rd International Society for Music Information Retrieval Conference, Bengaluru, India, 2022.

    License MIT Licensed. Copyright © 2022 New York University Shanghai Music X Lab. All rights reserved.

  16. P

    PhysioNet Challenge 2020 Dataset

    • paperswithcode.com
    Updated Dec 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna (2020). PhysioNet Challenge 2020 Dataset [Dataset]. https://paperswithcode.com/dataset/physionet-challenge-2020
    Explore at:
    Dataset updated
    Dec 30, 2020
    Authors
    Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna
    Description

    Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.

    The second source set is the public dataset from St Petersburg INCART 12-lead Arrhythmia Database. This database consists of 74 annotated recordings extracted from 32 Holter records. Each record is 30 minutes long and contains 12 standard leads, each sampled at 257 Hz.

    The third source from the Physikalisch Technische Bundesanstalt (PTB) comprises two public databases: the PTB Diagnostic ECG Database and the PTB-XL, a large publicly available electrocardiography dataset. The first PTB database contains 516 records (male: 377, female: 139). Each recording was sampled at 1000 Hz. The PTB-XL contains 21,837 clinical 12-lead ECGs (male: 11,379 and female: 10,458) of 10 second length with a sampling frequency of 500 Hz.

    The fourth source is a Georgia database which represents a unique demographic of the Southeastern United States. This training set contains 10,344 12-lead ECGs (male: 5,551, female: 4,793) of 10 second length with a sampling frequency of 500 Hz.

    The fifth source is an undisclosed American database that is geographically distinct from the Georgia database. This source contains 10,000 ECGs (all retained as test data).

    All data is provided in WFDB format. Each ECG recording has a binary MATLAB v4 file (see page 27) for the ECG signal data and a text file in WFDB header format describing the recording and patient attributes, including the diagnosis (the labels for the recording). The binary files can be read using the load function in MATLAB and the scipy.io.loadmat function in Python; please see our baseline models for examples of loading the data. The first line of the header provides information about the total number of leads and the total number of samples or points per lead. The following lines describe how each lead was saved, and the last lines provide information on demographics and diagnosis. Below is an example header file A0001.hea:

    A0001 12 500 7500 05-Feb-2020 11:39:16
    A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I
    A0001.mat 16+24 1000/mV 16 0 7 2029 0 II
    A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III
    A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR
    A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL
    A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF
    A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1
    A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2
    A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3
    A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4
    A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5
    A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6
    
    Age: 74
    Sex: Male
    Dx: 426783006
    Rx: Unknown
    Hx: Unknown
    Sx: Unknown
    

    From the first line, we see that the recording number is A0001, and the recording file is A0001.mat. The recording has 12 leads, each recorded at 500 Hz sample frequency, and contains 7500 samples. From the next 12 lines, we see that each signal was written at 16 bits with an offset of 24 bits, the amplitude resolution is 1000 with units in mV, the resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. The first value of the signal, the checksum, and the lead name are included for each signal. From the final 6 lines, we see that the patient is a 74-year-old male with a diagnosis (Dx) of 426783006. The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown.

    Each ECG recording has one or more labels from different type of abnormalities in SNOMED-CT codes. The full list of diagnoses for the challenge has been posted here as a 3 column CSV file: Long-form description, corresponding SNOMED-CT code, abbreviation. Although these descriptions apply to all training data there may be fewer classes in the test data, and in different proportions. However, every class in the test data will be represented in the training data.

  17. s

    Dataset for "Skyrmion states in thin confined polygonal nanostructures"

    • eprints.soton.ac.uk
    • data.niaid.nih.gov
    • +1more
    Updated Nov 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hovorka, Ondrej; Albert, Maximilian; Wang, Weiwei; Kluyver, Thomas; Carey, Rebecca; Fangohr, Hans; Pepper, Ryan Alexander; Vousden, Mark; Beg, Marijan; Cortes-Ortuno, David; Bisotti, Marc-Antonio (2017). Dataset for "Skyrmion states in thin confined polygonal nanostructures" [Dataset]. http://doi.org/10.5281/zenodo.1066792
    Explore at:
    Dataset updated
    Nov 27, 2017
    Dataset provided by
    Zenodo
    Authors
    Hovorka, Ondrej; Albert, Maximilian; Wang, Weiwei; Kluyver, Thomas; Carey, Rebecca; Fangohr, Hans; Pepper, Ryan Alexander; Vousden, Mark; Beg, Marijan; Cortes-Ortuno, David; Bisotti, Marc-Antonio
    Description

    This dataset provides micromagnetic simulation data collected from a series of computational experiments on the effects of polygonal system shape on the energy of different magnetic states in FeGe. The data here form the results of the study ‘Skyrmion states in thin confined polygonal nanostructures.’ The dataset is split into several directories: Data square-samples and triangle-samples These directories contain final state ‘relaxed’ magnetization fields for square and triangle samples respectively. The files within are organised into directories such that a sample of side length d = 40nm and which was subjected to an applied field of 500mT is labelled d40b500. Within each directory are twelve VTK unstructured grid format files (with file extension “.vtu”). These can be viewed in a variety of programmes; as of the time of writing we recommend either ParaView or MayaVi. The twelve files correspond to twelve simulations for each sample simulated, corresponding to twelve states from which each sample was relaxed - these are described in the paper which this dataset accompanies, but we note the labels are: ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘h’, ‘u’, ‘r1’, ‘r2’, ‘r3’, ‘h2’, ‘h3’ where: 0 - 4 are incomplete to overcomplete skyrmions, h, h2 and h3 are helical states with different periodicities r1-r3 are different random states u is the uniform magnetisation The vtu files are labelled according to parameters used in the simulation. For example, a file labelled ‘160_10_3_0_u_wd000000.vtu’ encodes that: The simulation was of a sample with side length 160nm. The simulation was of a sample of thickness 10nm. The maximum length of an edge in the finite element mesh of the sample was 3nm. The system was relaxed from the ‘u’. ‘wd’ encodes that the simulation was performed with a full demagnetizing calculation. square-npys and triangle-npys These directories contain computed information about each of the final states stored in square-samples and triangle-samples. This information is stored in NumPy npz files, and can be read in Python straightforwardly using the function numpy.load. Within each npz file, there are 8 arrays, each with 12 elements. These arrays are: ‘E’ - corresponds to the total energy of the relaxed state. ‘E_exchange’ - corresponds to the Exchange energy of the relaxed state. ‘E_demag’ - corresponds to the Demagnetizing energy of the relaxed state. ‘E_dmi’ - corresponds to the Dzyaloshinskii-Moriya energy of the relaxed state. ‘E_zeeman’ - corresponds to the Zeeman energy of the relaxed state. ‘S’ - Calculated Skyrmion number of the relaxed state. ‘S_abs’ - Calculated absolute Skyrmion number - see paper for calculation details. ‘m_av’ - Computed normalised average magnetisation in x, y, and z directions for relaxed state The twelve elements here correspond to the aforementioned twelve states relaxed from, and the ordering of the array is that of the order given above. square-classified and triangle-classified These directories contain a labelled dataset which gives details about what the final state in each simulation is. The files are stored as plain text, and are labelled with the following structure (the meanings of which are defined in the paper which this dataset accompanies): iSk - Incomplete Skyrmion Sk, or a number n followed by Sk - n Skyrmions in the state. He - A helical state Target - A target state. The files contain the names of png files which are generated from the vtu files in the format ‘d_165b_350_2.png’. This example, if found in the ‘Sk.txt’ file, means that the sample which was 165nm in side length and which was relaxed under a field of 350mT from initial state 2 was found at equilibrium in a Skyrmion state. Figures square-pngs and triangle-pngs These directories contain generated pngs from the vtu files. These are included for convenience as they take several hours to generate. Each directory contains three subdirectories: all-states This directory contains the simulation results from all samples, in the format ‘d_165b_350_2.png’, which means that the image contained here is that of the 165nm side length sample relaxed under a 350mT field from initial state 2. ground-state This directory contains the images which correspond to the lowest energy state found from all of the initial states. These are labelled as ‘d_180b_50.png’, such that the image contained in this file is the the lowest energy state found from all twelve simulations of the 180nm sidelength under a 50mT field. uniform-state This directory contains the images which correspond to the states relaxed only from the uniform state. These are labelled such that an image labelled ‘d_55b_100.png’ is the state found from relaxing a 180nm sample under a 100mT applied field. phase-diagrams These are the generated phase diagrams which are found in the paper. scripts This folder contains Python scripts which generate the png files mentioned above, and also the phase diagram figures for the paper this dataset accompanies. The scripts are labelled descriptively with what they do - for e.g. ’triangle-generate-png-all-states.py’ contains the script which loads vtu files and generates the png files. The exception here is ’render.py’ which provides functions used across multiple scripts. These scripts can be modified - for example; the function 'export_vector_field' has many options which can be adjusted to, for example, plot different components of the magnetization. In order to run the scripts reproducibly, in the root directory we have provided a Makefile which builds each component. In order to reproduce the figures yourself, on a Linux system, ParaView must be installed. The Makefile has been tested on Ubuntu 16.04 with ParaView 5.0.1. In addition, a number of Python dependencies must also be installed. These are: scipy >=0.19.1 numpy >= 1.11.0 matplotlib == 1.5.2 pillow>=3.1.2 We have included a requirements.txt file which specifies these dependencies; they can be installed by running 'pip install -r requirements.txt' from the directory. Once all dependencies are installed, simply run the command ‘make’ from the shell to build the Docker image and generate the figures. Note the scripts will take a long time to run - at the time of writing the runtime will be on the order of several hours on a high-specification desktop machine. For convenience, we have therefore included the generated figures within the repository (as noted above). It should be noted that for the versions used in the paper, adjustments have been made after the generation of the figures, (for e.g. to add images of states within the metastability figure, and overlaying boundaries in the phase diagrams). If you want to reproduce only the phase diagrams, and not the pngs, the command ‘make phase-diagrams’ will do so. This is the smallest part of the figure reproduction, and takes around 5 minutes on a high-specification desktop.

  18. 4

    Data from: Data and scripts underlying the publication: Quantifying the...

    • data.4tu.nl
    zip
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max van Mulken; J.A.J. (Jasper) Eikelboom (2025). Data and scripts underlying the publication: Quantifying the Spatial Scales of Animal Clusters Using Density Surfaces [Dataset]. http://doi.org/10.4121/61be5dd9-7880-48dc-bacf-36afbc3033ee.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Max van Mulken; J.A.J. (Jasper) Eikelboom
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    2014
    Area covered
    Description

    Supplementary scripts to the publication "Quantifying the spatial scales of animal clustering using Density Surfaces"


    We implement a method to quantify the degree of clustering of point location data at different spatial scales,

    which uses Kernel Density Estimation to construct a density function from the underlying point-location data.

    We build upon this method to automatically detect cluster diameters using smoothing kernels that better represent the perception neighbourhood of animals.

    More details can be found in the manuscript.


    These scripts construct the artificial data sets and results in the figures in the main text of the manuscript.


    data_generator.py

    This file contains the functions to construct the artificial data sets, as well as visualization tools to plot the point sets.

    Running the main() function:

    1. constructs all artificial data sets

    2. creates visualizations of all generated and real-life datasets, saves them as .pdf files, and shows them on-screen


    metric_calculator.py

    This file contains the functions to calculate the metric described in the manuscript, as well as to compute Ripley's K function

    and the Radial Distribution Function.

    Running the main() function:

    1. generates the metric functions for all artificial and real-life data sets

    2. creates visualizations of all generated metric functions, saves them as .pdf files, and shows them on-screen

    3. prints the found relevant spatial scales, and their metric values, in the terminal


    elephant.pickle

    This file contains the real-world dataset of elephant locations to be used in metric_calculator.py

    The original data was collected in March 2014 in the Tsavo National Parks, Kenya.

    We use a subset of the original data set, consisting of location data of 24 elephants obtained from an aerial image that were manually taken by human observers upon spotting the animals.

    The aerial image was manually processed into spatial data by placing a point on the approximate centre point of each animal in the image, and projected onto a 100x100 xy-plane.

    The data is serialized and de-serialized using the native Python package "pickle". The data format used by pickle is Python-specific.


    To perform the experiments:

    1. Ensure you have a functioning Python3 installation.

    2. Install the required packages using pip:

    - numpy

    - matplotlib

    - scipy

    - scikit-learn

    3. Run the main() function in data_generator.py to generate the artificial datasets

    4. Run the main() function in metric_calculator.py to generate the metric functions and figures


  19. d

    CLM AWRA HRVs Uncertainty Analysis

    • data.gov.au
    • researchdata.edu.au
    • +1more
    Updated Nov 19, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). CLM AWRA HRVs Uncertainty Analysis [Dataset]. https://data.gov.au/data/dataset/e51a513d-fde7-44ba-830c-07563a7b2402
    Explore at:
    Dataset updated
    Nov 19, 2019
    Dataset provided by
    Bioregional Assessment Program
    Description

    Abstract

    This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).

    Dataset History

    File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).

    R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:

    CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions

    CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)

    CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations

    Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):

    CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)

    CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available

    CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)

    CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV

    R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).

    The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.

    Dataset Citation

    Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.

    Dataset Ancestors

  20. Z

    Spiking Seizure Classification Dataset

    • data.niaid.nih.gov
    Updated Jan 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gallou, Olympia (2025). Spiking Seizure Classification Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10800793
    Explore at:
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Matthew, Cook
    Gallou, Olympia
    Bartels, Jim
    Sarnthein, Johannes
    Indiveri, Giacomo
    GHOSH, SAPTARSHI
    Ito, Hiroyuki
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset for event encoded analog EEG signals for detection of Epileptic seizures

    This dataset contains events that are encoded from the analog signals recorded during pre-surgical evaluations of patients at the Sleep-Wake-Epilepsy-Center (SWEC) of the University Department of Neurology at the Inselspital Bern. The analog signals are sourced from the SWEC-ETHZ iEEG Database

    This database contains event streams for 10 seizures recorded from 5 patients and generated by the DYnamic Neuromorphic Asynchronous Processor (DYNAP-SE2) to demonstrate a proof-of-concept of encoding seizures with network synchronization. The pipeline consists of two parts (I) an Analog Front End (AFE) and (II) an SNN termed as"Non-Local Non-Global" (NLNG) network.

    In the first part of the pipeline, the digitally recorded signals from SWEC-ETHZ iEEG Database are converted to analog signals via an 18-bit Digital-to-Analog converter (DAC) and then amplified and encoded into events by an Asynchronous Delta Modulator (ADM). Then in the second part, the encoded event streams are fed into the SNN that extracts the features of the epileptic seizure by extracting the partial synchronous patterns intrinsic to the seizure dynamics.

    Details about the neuromorphic processing pipeline and the encoding process are included in a manuscript under review. The preprint is available in bioRxiv

    InstallationThe installation requires Python>=3.x and conda (or py-venv) package. Users can then install the requirements inside a conda environment using

    conda env create -f requirements.txt -n sez

    Once created the conda environment can be activated with conda activate sez

    The main files in the database are described in the hierarchy below.

    EventSezDataset/

    ├─ data/

    │ ├─ P x S x

    │ │ ├─ Pat x Sz x _CH x .csv

    ├─ LSVM_Params/

    │ ├─ opt_svm_params/

    │ ├─ pat_x_features_SYNCH/

    ├─ fig_gen.py

    ├─ sync_mat_gen.py

    ├─ SeizDetection_FR.py

    ├─ SeizDetection_SYNCH.py

    ├─ support.py

    ├─ run.sh

    ├─ requirements.txt

    where x represents the Patient ID and the Seizure ID respectively.

    requirements.txt: This file lists the requirements for the execution of the Python code.

    fig_gen.py: This file plots the analog signals and the associated AFE and NLNG event streams. The execution of the code happens with `python fig_gen.py 1 1 13', where patient 2, seizure 1, and channel 13 of the recording are plotted.

    sync_mat_gen.py: This file describes the function for plotting the synchronization matrices emerging from the ADM and the NLNG spikes with either linear or log colorbar. The execution of the code happens with python sync_mat_gen.py 1 1' orpython sync_mat_gen.py 1 1 log'. This execution generated four figures for pre-seizure, First Half of seizure, Second Half of seizure, and post-seizure time periods, where patient 1 and seizure 1. The third option can either be left blank or input as lin or log, for respective color bar scales. The time is the signal-time as mentioned in the table below.

    run.sh: A simple Linux script to run the above code for all patients and seizures.

    SeizDetection_FR.py: This file runs the LSVM on the ADM and NLNG spikes, using the firing rate (FR) as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/opt_svm_params/ folder). Users can use the code for training the LSVM with different parameters as well.

    SeizDetection_SYNCH.py: This file runs the LSVM on the kernelized ADM and NLNG spikes, using the flattened SYNC matrices as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/pat_x_features_SYNCH/ folder). Users can use the code for training the LSVM with different parameters as well.

    LSVM_Params: Folder containing LSVM features with different parameter combinations.

    support.py: This file contains the necessary functions.

    data/P1S1/: This folder, for example, contains the event streams for all channels for seizure 1 of patient 1.

    Pat1_Sz_1_CH1.csv: This file contains the spikes of the AFE and the NLNG layers with the following tabular format (which can be extracted by the fig_gen.py)

    Comments

    SStart: 180 //Start of the Seizure in signal time# SEnd: 276.0 //Start of the Seizure in signal time# Pid: 2 // The patient ID as per the SWEC-ETHZ iEEG Database # Sid: 1 // The Seizure ID as per the SWEC-ETHZ iEEG Database # Channel_No: 1 // The channel number

    SYS_time signal_time dac_value ADMspikes NLNGspikes

    The time from the interface FPGA The time of the signal as per the SWEC ETHZ Database The value of the analog signals as recorded in the SWEC ETHZ Database The event-steam is the output of the AFE in boolean format. True represents a spike The spike-steam is the output of the SNN in boolean format. True represents a spike

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Farouk (2023). SPP_30K_reasoning_tasks [Dataset]. https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks

SPP_30K_reasoning_tasks

SPP python reasoning tasks

pharaouk/SPP_30K_reasoning_tasks

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2023
Authors
Farouk
Description

Dataset Card for "SPP_30K_verified_tasks"

  Dataset Summary

This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.

Search
Clear search
Close search
Google apps
Main menu