58 datasets found

h
SPP_30K_reasoning_tasks
huggingface.co
Updated Aug 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farouk (2023). SPP_30K_reasoning_tasks [Dataset]. https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2023
Authors
Farouk
Description
Dataset Card for "SPP_30K_verified_tasks"

Dataset Summary

This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.
P
CodeSearchNet Dataset
paperswithcode.com
opendatalab.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

h
Python-codes
huggingface.co
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjun G Ravi (2023). Python-codes [Dataset]. https://huggingface.co/datasets/Arjun-G-Ravi/Python-codes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Authors
Arjun G Ravi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Dataset Name

Please note that this dataset maynot be perfect and may contain a very small quantity of non python codes. But the quantity appears to be very small

Dataset Summary

The dataset contains a collection of python question and their code. This is meant to be used for training models to be efficient in Python specific coding. The dataset has two features - 'question' and 'code'. An example is: {'question': 'Create a function that takes in a string… See the full description on the dataset page: https://huggingface.co/datasets/Arjun-G-Ravi/Python-codes.

MNIST IDX Dataset- Fasion

kaggle.com

Updated May 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

ShreyaSuresh (2025). MNIST IDX Dataset- Fasion [Dataset]. https://www.kaggle.com/datasets/shreyasuresh0407/mnist-idx-dataset-fasion

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 21, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

ShreyaSuresh

Description

📦 About the Dataset

This project uses a classic machine learning dataset of handwritten digits — the MNIST dataset — stored in IDX format.

🧠 Each image is a 28x28 pixel grayscale picture of a handwritten number from 0 to 9. Your task is to teach a simple neural network (your "brain") to recognize these digits.

🔍 What’s Inside?

File Name	Description
`train-images-idx3-ubyte`	🖼️ 60,000 training images (28x28 pixels each)
`train-labels-idx1-ubyte`	🔢 Labels (0–9) for each training image
`t10k-images-idx3-ubyte`	🖼️ 10,000 test images
`t10k-labels-idx1-ubyte`	🔢 Labels (0–9) for test images

All files are in the IDX binary format, which is compact and fast for loading, but needs to be parsed using a small Python function (see below 👇).

###✨ Why This Dataset Is Awesome

🎯 It's the “Hello World” of machine learning — perfect for beginners
📊 Ideal for testing image classification algorithms
🧠 Helps you learn how neural networks "see" numbers
💥 Small enough to train quickly, powerful enough to learn real skills

🧩 Sample Image

(Add this cell below in your notebook to visualize a few images)

import matplotlib.pyplot as plt

# Show the first 10 images
fig, axes = plt.subplots(1, 10, figsize=(15, 2))
for i in range(10):
  axes[i].imshow(train_images[i][0], cmap="gray")
  axes[i].set_title(f"Label: {train_labels[i].item()}")
  axes[i].axis("off")
plt.show()

Python functions -- cross-validation methods from a data-driven perspective
zenodo.org
phys-techsciences.datastations.nl
bin, txt, zip
Updated Aug 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanwen Wang; Yanwen Wang (2024). Python functions -- cross-validation methods from a data-driven perspective [Dataset]. http://doi.org/10.17026/pt/txau9w
Explore at:
txt, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.17026/pt/txau9w
Dataset updated
Aug 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yanwen Wang; Yanwen Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 28, 2024
Description
This is the organized python functions of proposed methods in Yanwen Wang PhD research. Researchers can directly use these functions to conduct spatial+ cross-validation (SP-CV), dissimilarity quantification by adversarial validation (AVD), and dissimilarity-adaptive cross-validation (DA-CV). The description of how to run codes is in Readme.txt. The descriptions of functions are in functions.docx.

Software Defects Dataset 1k

kaggle.com

Updated Jun 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ravikumar R N (2025). Software Defects Dataset 1k [Dataset]. https://www.kaggle.com/datasets/ravikumarrn/software-defects-dataset-1k/versions/1

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 16, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ravikumar R N

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

📦 Software Defects Multilingual Dataset with AST & Token Features

This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.

🙋 Please Citation

If you use this dataset in your research or project, please cite it as:

"Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."

🧠 Dataset Highlights

Languages Included: Python, Java, JavaScript, C, C++, Go, Rust
Records: 1,000 code snippets
Labels: defect (1 = buggy, 0 = clean)
Features:
- token_count: Total tokens (AST-based for Python)
- num_ifs, num_returns, num_func_calls: Code structure features
- ast_nodes: Number of nodes in the abstract syntax tree (Python only)
- lines_of_code & cyclomatic_complexity: Simulated metrics for modeling
📊 Columns Description

Column	Description
`function_name`	Unique identifier for the function
`code`	The actual function source code
`language`	Programming language used
`lines_of_code`	Approximate number of lines in the function
`cyclomatic_complexity`	Simulated measure of decision complexity
`defect`	1 = buggy, 0 = clean
`token_count`	Total token count (Python uses AST tokens)
`num_ifs`	Count of 'if' statements
`num_returns`	Count of 'return' statements
`num_func_calls`	Number of function calls
`ast_nodes`	AST node count (Python only, fallback = token count)

🛠️ Usage Examples

This dataset is suitable for:

Training traditional ML models like Random Forests or XGBoost
Evaluating prompt-based or fine-tuned LLMs (e.g., CodeT5, GPT-4)
Feature importance studies using AST and static code metrics
Cross-lingual transfer learning in code understanding

📎** License**

This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.

W
HUN GW Uncertainty Analysis v01
cloud.csiss.gmu.edu
researchdata.edu.au
+2more
zip
Updated Dec 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australia (2019). HUN GW Uncertainty Analysis v01 [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a
Explore at:
zipAvailable download formats
Dataset updated
Dec 13, 2019
Dataset provided by
Australia
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.

References:

Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.

Dataset History

This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).

A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).

R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.

The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).

Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.

Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01

These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv

Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv.

The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.

The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.

The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.

Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.

The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv

Dataset Citation

Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.

Dataset Ancestors

Derived From HUN GW Model code v01

Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013

Derived From Groundwater Economic Elements Hunter NSW 20150520 PersRem v02

Derived From NSW Office of Water - National Groundwater Information System 20140701

Derived From Travelling Stock Route Conservation Values

Derived From HUN GW Model v01

Derived From NSW Wetlands

Derived From Climate Change Corridors Coastal North East NSW

Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only

Derived From Climate Change Corridors for Nandewar and New England Tablelands

Derived From National Groundwater Dependent Ecosystems (GDE) Atlas

Derived From Fauna Corridors for North East NSW

Derived From R-scripts for uncertainty analysis v01

Derived From Asset database for the Hunter subregion on 27 August 2015

Derived From Hunter CMA GDEs (DRAFT DPI pre-release)

Derived From Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter 2004

Derived From Birds Australia - Important Bird Areas (IBA) 2009

Derived From [Camerons Gorge
Z
DNP3 Intrusion Detection Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panagiotis (2024). DNP3 Intrusion Detection Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7348493
Explore at:
Dataset updated
Jul 15, 2024
Dataset provided by
Thomas
Panagiotis
Vasiliki
Vasileios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1.Introduction

In the digital era of the Industrial Internet of Things (IIoT), the conventional Critical Infrastructures (CIs) are transformed into smart environments with multiple benefits, such as pervasive control, self-monitoring and self-healing. However, this evolution is characterised by several cyberthreats due to the necessary presence of insecure technologies. DNP3 is an industrial communication protocol which is widely adopted in the CIs of the US. In particular, DNP3 allows the remote communication between Industrial Control Systems (ICS) and Supervisory Control and Data Acquisition (SCADA). It can support various topologies, such as Master-Slave, Multi-Drop, Hierarchical and Multiple-Server. Initially, the architectural model of DNP3 consists of three layers: (a) Application Layer, (b) Transport Layer and (c) Data Link Layer. However, DNP3 can be now incorporated into the Transmission Control Protocol/Internet Protocol (TCP/IP) stack as an application-layer protocol. However, similarly to other industrial protocols (e.g., Modbus and IEC 60870-5-104), DNP3 is characterised by severe security issues since it does not include any authentication or authorisation mechanisms. More information about the DNP3 security issue is provided in [1-3]. This dataset contains labelled Transmission Control Protocol (TCP) / Internet Protocol (IP) network flow statistics (Common-Separated Values - CSV format) and DNP3 flow statistics (CSV format) related to 9 DNP3 cyberattacks. These cyberattacks are focused on DNP3 unauthorised commands and Denial of Service (DoS). The network traffic data are provided through Packet Capture (PCAP) files. Consequently, this dataset can be used to implement Artificial Intelligence (AI)-powered Intrusion Detection and Prevention (IDPS) systems that rely on Machine Learning (ML) and Deep Learning (DL) techniques.

2.Instructions

This DNP3 Intrusion Detection Dataset was implemented following the methodological frameworks of A. Gharib et al. in [4] and S. Dadkhah et al in [5], including eleven features: (a) Complete Network Configuration, (b) Complete Traffic, (c) Labelled Dataset, (d) Complete Interaction, (e) Complete Capture, (f) Available Protocols, (g) Attack Diversity, (h) Heterogeneity, (i) Feature Set and (j) Metadata.

A network topology consisting of (a) eight industrial entities, (b) one Human Machine Interfaces (HMI) and (c) three cyberattackers was used to implement this DNP3 Intrusion Detection Dataset. In particular, the following cyberattacks were implemented.

On Thursday, May 14, 2020, the DNP3 Disable Unsolicited Messages Attack was executed for 4 hours.

On Friday, May 15, 2020, the DNP3 Cold Restart Message Attack was executed for 4 hours.

On Friday, May 15, 2020, the DNP3 Warm Restart Message Attack was executed for 4 hours.

On Saturday, May 16, 2020, the DNP3 Enumerate Attack was executed for 4 hours.

On Saturday, May 16, 2020, the DNP3 Info Attack was executed for 4 hours.

On Monday, May 18, 2020, the DNP3 Initialisation Attack was executed for 4 hours.

On Monday, May 18, 2020, the Man In The Middle (MITM)-DoS Attack was executed for 4 hours.

On Monday, May 18, 2020, the DNP3 Replay Attack was executed for 4 hours.

On Tuesday, May 19, 2020, the DNP3 Stop Application Attack was executed for 4 hours.

The aforementioned DNP3 cyberattacks were executed, utilising penetration testing tools, such as Nmap and Scapy. For each attack, a relevant folder is provided, including the network traffic and the network flow statistics for each entity. In particular, for each cyberattack, a folder is given, providing (a) the pcap files for each entity, (b) the Transmission Control Protocol (TCP)/ Internet Protocol (IP) network flow statistics for 120 seconds in a CSV format and (c) the DNP3 flow statistics for each entity (using different timeout values in terms of second (such as 45, 60, 75, 90, 120 and 240 seconds)). The TCP/IP network flow statistics were produced by using the CICFlowMeter, while the DNP3 flow statistics were generated based on a Custom DNP3 Python Parser, taking full advantage of Scapy.

Dataset Structure

The dataset consists of the following folders:

20200514_DNP3_Disable_Unsolicited_Messages_Attack: It includes the pcap and CSV files related to the DNP3 Disable Unsolicited Message attack.

20200515_DNP3_Cold_Restart_Attack: It includes the pcap and CSV files related to the DNP3 Cold Restart attack.

20200515_DNP3_Warm_Restart_Attack: It includes the pcap and CSV files related to DNP3 Warm Restart attack.

20200516_DNP3_Enumerate: It includes the pcap and CSV files related to the DNP3 Enumerate attack.

20200516_DNP3_Ιnfo: It includes the pcap and CSV files related to the DNP3 Info attack.

20200518_DNP3_Initialize_Data_Attack: It includes the pcap and CSV files related to the DNP3 Data Initialisation attack.

20200518_DNP3_MITM_DoS: It includes the pcap and CSV files related to the DNP3 MITM-DoS attack.

20200518_DNP3_Replay_Attack: It includes the pcap and CSV files related to the DNP3 replay attack.

20200519_DNP3_Stop_Application_Attack: It includes the pcap and CSV files related to the DNP3 Stop Application attack.

Training_Testing_Balanced_CSV_Files: It includes balanced CSV files from CICFlowMeter and the Custom DNP3 Python Parser that could be utilised for training ML and DL methods. Each folder includes different sub-folder for the corresponding flow timeout values used by the DNP3 Python Custom Parser. For CICFlowMeter, only the timeout value of 120 seconds was used.

Each folder includes respective subfolders related to the entities/devices (described in the following section) participating in each attack. In particular, for each entity/device, there is a folder including (a) the DNP3 network traffic (pcap file) related to this entity/device during each attack, (b) the TCP/IP network flow statistics (CSV file) generated by CICFlowMeter for the timeout value of 120 seconds and finally (c) the DNP3 flow statistics (CSV file) from the Custom DNP3 Python Parser. Finally, it is noteworthy that the network flows from both CICFlowMeter and Custom DNP3 Python Parser in each CSV file are labelled based on the DNP3 cyberattacks executed for the generation of this dataset. The description of these attacks is provided in the following section, while the various features from CICFlowMeter and Custom DNP3 Python Parser are presented in Section 5.

4.Testbed & DNP3 Attacks

The following figure shows the testbed utilised for the generation of this dataset. It is composed of eight industrial entities that play the role of the DNP3 outstations/slaves, such as Remote Terminal Units (RTUs) and Intelligent Electron Devices (IEDs). Moreover, there is another workstation which plays the role of the Master station like a Master Terminal Unit (MTU). For the communication between, the DNP3 outstations/slaves and the master station, opendnp3 was used.

Table 1: DNP3 Attacks Description

DNP3 Attack

Description

Dataset Folder

DNP3 Disable Unsolicited Message Attack

This attack targets a DNP3 outstation/slave, establishing a connection with it, while acting as a master station. The false master then transmits a packet with the DNP3 Function Code 21, which requests to disable all the unsolicited messages on the target.

20200514_DNP3_Disable_Unsolicited_Messages_Attack

DNP3 Cold Restart Attack

The malicious entity acts as a master station and sends a DNP3 packet that includes the “Cold Restart” function code. When the target receives this message, it initiates a complete restart and sends back a reply with the time window before the restart process.

20200515_DNP3_Cold_Restart_Attack

DNP3 Warm Restart Attack

This attack is quite similar to the “Cold Restart Message”, but aims to trigger a partial restart, re-initiating a DNP3 service on the target outstation.

20200515_DNP3_Warm_Restart_Attack

DNP3 Enumerate Attack

This reconnaissance attack aims to discover which DNP3 services and functional codes are used by the target system.

20200516_DNP3_Enumerate

DNP3 Info Attack

This attack constitutes another reconnaissance attempt, aggregating various DNP3 diagnostic information related the DNP3 usage.

20200516_DNP3_Ιnfo

Data Initialisation Attack

This cyberattack is related to Function Code 15 (Initialize Data). It is an unauthorised access attack, which demands from the slave to re-initialise possible configurations to their initial values, thus changing potential values defined by legitimate masters

20200518_Initialize_Data_Attack

MITM-DoS Attack

In this cyberattack, the cyberattacker is placed between a DNP3 master and a DNP3 slave device, dropping all the messages coming from the DNP3 master or the DNP3 slave.

20200518_MITM_DoS

DNP3 Replay Attack

This cyberattack replays DNP3 packets coming from a legitimate DNP3 master or DNP3 slave.

20200518_DNP3_Replay_Attack

DNP3 Step Application Attack

This attack is related to the Function Code 18 (Stop Application) and demands from the slave to stop its function so that the slave cannot receive messages from the master.

20200519_DNP3_Stop_Application_Attack

Features

The TCP/IP network flow statistics generated by CICFlowMeter are summarised below. The TCP/IP network flows and their statistics generated by CICFlowMeter are labelled based on the DNP3 attacks described above, thus allowing the training of ML/DL models. Finally, it is worth mentioning that these statistics are generated when the flow timeout value is equal with 120 seconds.

Table
Klib library python
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT

In-vitro dataset for classification and regression of stenosis: dependence...

zenodo.org

zip

Updated Nov 12, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Stefan Bernhard; Michelle Wisotzki; Alexander Mair; Stefan Bernhard; Michelle Wisotzki; Alexander Mair (2022). In-vitro dataset for classification and regression of stenosis: dependence on heart rate, waveform and location [Dataset]. http://doi.org/10.5281/zenodo.6421498

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6421498

Dataset updated

Nov 12, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Stefan Bernhard; Michelle Wisotzki; Alexander Mair; Stefan Bernhard; Michelle Wisotzki; Alexander Mair

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Background

This data supplements the paper "Classification and regression of stenosis using an in-vitro pulse wave dataset:
dependence on heart rate, waveform and location". It was created at Technische Hochschule Mittelhessen (THM) in Germany and uploaded to Zenodo. Please cite the paper and the Zenodo doi when using this dataset.

General description / Dataset structure

Each mat-File describes a different measurement (details can be found in the paper). There are 17 pressure signals for different positions, one flow sensor close to the stenosis location and one monitor signal of the proportional valve use to control the input curve. Total duration of each signal is 60s with a sampling rate of 1000 Hz. Each mat-file contains a header structure with metadata and struct array for signals of each sensor. Signals in each mat-File are aligned with respect to a common time axis, but this is not guaranteed between different measurements/files. We did our best to make the beginnings end endings align as close as possible (by removing buffer artefacts and aligning the input signal of the monitor), however algorithms should not rely on a global time axis. This similar to patient measurements without an ekg, this does also not share a global time axis comparable among patients.

The file format can either be loaded directly in Matlab or in Python with scipy's loadmat function.

The data is structure first by stenosis "state" (or location) then by heart rate and then by heart waveform. The stenosis "states" can devided in 1 subset of 10 folders created for regression and 6 created for classification. Excerpt of the folder structure:

No Stenosis
- HR 50
  - WaveForm1.mat
  - WaveForm2.mat
  - ...
- HR 55
  - ...
- ...
Regression - Stenosis at Pos01
- HR 50
  - ...
- ...
...

The tools also available at this page help with traversing this folder structure and are available for Python and Matlab.

Data Fields of each file

headerStruct
field	description
id	internal database id
name	stenosis location
rate	sampling rate in Hz
description	definition of automatic parameter sweep range
configuration	concrete parameters of the trapezoidal input curve (offset and amplitude in mmHg, ascend times and descend times and smoothing window in a fraction the time period (1.2s))

signalStruct
field	description
nodeId	corresponds to numbered nodes at which the sensor is placed, the corresponding location can be found in the technical paper describing the MACSim simulator (node numbering, not sensor numbers) or in the software SISCA in the example database.
type	'p' ... pressure or 'q' ... flow
data	double array, time series of each sensor, unit mmHg for type 'p' and ml/s for type 'q'
anatomicalPosition	name of the corresponding anatomical position

Tools:

This Tools should make it easier to load the dataset. The usage is documented in the respective code files.

Code for the publication is available here:
https://gitlab.com/agbernhard.lse.thm/publication_macsim_machinelearning

f
Texas Synthetic Power System Test Case (TX-123BT).zip
figshare.com
zip
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Lu; Xingpeng Li (2024). Texas Synthetic Power System Test Case (TX-123BT).zip [Dataset]. http://doi.org/10.6084/m9.figshare.22144616.v6
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22144616.v6
Dataset updated
Mar 8, 2024
Dataset provided by
figshare
Authors
Jin Lu; Xingpeng Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Texas
Description
The dataset of the synthetic Texas 123-bus backbone transmission (TX-123BT) system.The procedures and details to create TX-123BT system are described in the paper below:Jin Lu, Xingpeng Li et al., “A Synthetic Texas Backbone Power System with Climate-Dependent Spatio-Temporal Correlated Profiles”.If you use this dataset in your work, please cite the paper above.***Introduction:The TX-123BT system has similar temporal and spatial characteristics as the actual Electric Reliability Council of Texas (ERCOT) system.TX-123BT system has a backbone network consisting of only high-voltage transmission lines distributed in the Texas territory.It includes time series profiles of renewable generation, electrical load, and transmission thermal limits for 5 years from 2017 to 2021.The North American Land Data Assimilation System (NLDAS) climate data is extracted and used to create the climate-dependent time series profiles mentioned above.Two sets of climate-dependent dynamic line rating (DLR) profiles are created: (i) daily DLR and (ii) hourly DLR.***Power system configuration data:'Bus_data.csv': Bus data including bus name and location (longitude & latitude, weather zone).'Line_data.csv': Line capacity and terminal bus information.'Generator_data.xlsx': 'Gen_data' sheet: Generator parameters including active/reactive capacity, fuel type, cost and ramping rate.'Solar Plant Number' sheet: Correspondence between the solar plant number and generator number.'Wind Plant Number' sheet: Correspondence between the wind plant number and generator number.***Time series profiles:'Climate_5y' folder: Include each day's climate data for solar radiation, air temperature, wind speed near surface at 10 meter height.Each file in the folder includes the hourly temperature, longwave & shortwave solar radiation, zonal & Meridional wind speed data of a day in 2019.'Hourly_line_rating_5y' folder: Include the hourly dynamic line rating for each day in the year.Each file includes the hourly line rating (MW) of a line for all hours in the year.In each file, columns represent hour 1-24 in a day, rows represent day 1-365 in the year.'Daily_line_rating_5y' folder: The daily dynamic line rating (MW) for all lines and all days in the year.'solar_5y' folder: Solar production for all the solar farms in the TX-123BT and for all the days in the year.Each file includes the hourly solar production (MW) of all the solar plants for a day in the year.In each file, columns represent hour 1-24 in a day, rows represent solar plant 1-72.'wind_5y' folder: Wind production for all the wind farms in the case and for all the days in the year.Each file includes the hourly wind production (MW) of all the wind plants for a day in the year.In each file, columns represent hour 1-24 in a day, rows represent wind plant 1-82.'load_5y' folder: Include each day's hourly load data on all the buses.Each file includes the hourly nodal loads (MW) of all the buses in a day in the year.In each file,columns represent bus 1-123, rows represent hour 1-24 in a day.***Python Codes to run security-constrainted unit commitment (SCUC) for TX-123BT profilesRecommand Python Version: Python 3.11Required packages: Numpy, pyomo, pypower, pickleRequired a solver which can be called by the pyomo to solve the SCUC optimization problem.*'Sample_Codes_SCUC' folder: A standard SCUC model.The load, solar generation, wind generation profiles are provided by 'load_annual','solar_annual', 'wind_annual' folders.The daily line rating profiles are provided by 'Line_annual_Dmin.txt'.'power_mod.py': define the python class for the power system.'UC_function.py': define functions to build, solve, and save results for pyomo SCUC model.'formpyomo_UC': define the function to create the input file for pyomo model.'Run_SCUC_annual': run this file to perform SCUC simulation on the selected days of the TX-123BT profiles.Steps to run SCUC simulation:1) Set up the python environment.2) Set the solver location: 'UC_function.py'=>'solve_UC' function=>UC_solver=SolverFactory('solver_name',executable='solver_location')3) Set the days you want to run SCUC: 'Run_SCUC_annual.py'=>last row: run_annual_UC(case_inst,start_day,end_day)For example: to run SCUC simulations for 125th-146th days in 2019, the last row of the file is 'run_annual_UC(case_inst,125,146)'You can also run a single day's SCUC simulation by using: 'run_annual_UC(case_inst,single_day,single_day)'* 'Sample_Codes_SCUC_HourlyDLR' folder: The SCUC model consider hourly dynamic line rating (DLR) profiles.The load, solar generation, wind generation profiles are provided by 'load_annual','solar_annual', 'wind_annual' folders.The hourly line rating profiles in 2019 are provided by 'dynamic_rating_result' folder.'power_mod.py': define the python class for the power system.'UC_function_DLR.py': define functions to build, solve, and save results for pyomo SCUC model (with hourly DLR).'formpyomo_UC': define the function to create the input file for pyomo model.'RunUC_annual_dlr': run this file to perform SCUC simulation (with hourly DLR) on the selected days of the TX-123BT profiles.Steps to run SCUC simulation (with hourly DLR):1) Set up the python environment.2) Set the solver location: 'UC_function_DLR.py'=>'solve_UC' function=>UC_solver=SolverFactory('solver_name',executable='solver_location')3) Set the daily profiles for SCUC simulation: 'RunUC_annual_dlr.py'=>last row: run_annual_UC_dlr(case_inst,start_day,end_day)For example: to run SCUC simulations (with hourly DLR) for 125th-146th days in 2019, the last row of the file is 'run_annual_UC_dlr(case_inst,125,146)'You can also run a single day's SCUC simulation (with hourly DLR) by using: 'run_annual_UC_dlr(case_inst,single_day,single_day)'The SCUC/SCUC with DLR simulation results are saved in the 'UC_results' folders under the corresponding folder.Under 'UC_results' folder:'UCcase_Opcost.txt': total operational cost ($)'UCcase_pf.txt': the power flow results (MW). Rows represent lines, columns represent hours.'UCcase_pfpct.txt': the percentage of the power flow to the line capacity (%). Rows represent lines, columns represent hours.'UCcase_pgt.txt': the generators output power (MW). Rows represent conventional generators, columns represent hours.'UCcase_lmp.txt': the locational marginal price ($/MWh). Rows represent buses, columns represent hours.***Geographic information system (GIS) data:'Texas_GIS_Data' folder: includes the geographic information systems (GIS) data of the TX-123BT system configurations and ERCOT weather zones.The GIS data can be viewed and edited using GIS software: ArcGIS.The subfolders are:'Bus' folder: the shapefile of bus data for the TX-123BT system.'Line' folder: the shapefile of line data for the TX-123BT system.'Weather Zone' folder: the shapefile of the weather zones in Electric Reliability Council of Texas (ERCOT).*** Maps(Pictures) of the TX-123BT & ERCOT Weather Zone'Maps_TX123BT_WeatherZone' folder:1) 'TX123BT_Noted.jpg': The maps (pictures) of the TX-123BT transmission network. Buses are in blue and lines are in green.2) 'Area_Houston_Noted.jpg', 'Area_Dallas_Noted.jpg', 'Area_Austin_SanAntonio_Noted.jpg':The maps for different areas including Houston, Dallas, and Austin-SanAntonio are also provided.3) 'Weather_Zone.jpg': The map of ERCOT weather zones. It's ploted by author, may be slightly different from the actual ERCOT weather zones.***FundingThis project is supported by Alfred P. Sloan Foundation.***License:This work is licensed under the terms of the Creative Commons Attribution 4.0 (CC BY 4.0) license.***Disclaimer:The author doesn’t make any warranty for the accuracy, completeness, or usefulness of any information disclosed and the author assumes no liability or responsibility for any errors or omissions for the information (data/code/results etc) disclosed.***Contributions:Jin Lu created this dataset. Xingpeng Li supervised this work. Hongyi Li and Taher Chegini provided the raw historical climate data (extracted from an open-access dataset - NLDAS).
d
Namoi groundwater uncertainty analysis
data.gov.au
researchdata.edu.au
+1more
Updated Nov 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). Namoi groundwater uncertainty analysis [Dataset]. https://data.gov.au/data/dataset/groups/36bd27e9-58d2-4bf2-8e4a-54b22ac98cfb
Explore at:
Dataset updated
Nov 20, 2019
Dataset provided by
Bioregional Assessment Program
Area covered
Namoi River
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

The dataset contains the predictions of maximum drawdown and time to maximum drawdown at all groundwater model nodes in the Namoi subregion, constrained by the observations of groundwater level, river flux and mine water production rates. The dataset also contains the scripts required for and the results of the sensitivity analysis. The dataset contains all the scripts to generate these results from the outputs of the groundwater model (Namoi groundwater model dataset) and all the spreadsheets with the results. The methodology and results are described in Janardhanan et al. (2017)

References

Janardhanan S, Crosbie R, Pickett T, Cui T, Peeters L, Slatter E, Northey J, Merrin LE, Davies P, Miotlinski K, Schmid W and Herr A (2017) Groundwater numerical modelling for the Namoi subregion. Product 2.6.2 for the Namoi subregion from the Northern Inland Catchments Bioregional Assessment. Department of the Environment and Energy, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia., http://data.bioregionalassessments.gov.au/product/NIC/NAM/2.6.2.

Dataset History

The workflow that underpins this dataset is captured in 'NAM_MF_UA_workflow.png'.

Spreadsheet NAM_MF_dmax_Predictions_all.csv is sourced from dataset 'Namoi groundwater model' and contains the name, coordinates, Bore_ID in the model, layer number, the name of the objective function and the minimum, maximum, median, 5th percentile and 95th percentile of the design of experiment runs of maximum drawdown (dmax) for each groundwater model node. The individual results for each node for each run of the design of experiment is stored in spreadsheet 'NAM_MF_dmax_DoE_Predictions_all.csv' The equivalent files for time to maximum drawdown (tmax) are 'NAM_MF_tmax_Predictions_all.csv' and 'NAM_MF_tmax_DoE_Predictions_all.csv'.

These files are combined with the file 'NAM_MF_Observations_all.csv', which contains the observed values for groundwater levels, mine dewatering rates and river flux, and the files NAM_MF_dist_hobs.csv, NAM_MF_dist_rivers.csv, NAM_MF_dist_mines.csv, which contain the distances of the predictions to each mine, groundwater level observation and river, in python script 'NAM_MF_datawranling.csv'. This script selects only those predictions where the 95th percentile of dmax is less than 1 cm for further analysis. The subset of predictions is stored in 'NAM_MF_dmax_Predictions.csv','NAM_MF_tmax_Predictions.csv', 'NAM_MF_dmax_DoE_Predictions.csv','NAM_MF_tmax_DoE_Predictions.csv'. The output spreadsheet 'NAM_MF_Observations.csv' has the observations and the distances to the selected predictions.

As the simulated equivalents to the observations are part of the predictions dataset, these files are combined in python script NAM_MF_OFs.py to generate the objective function values for each run and each prediction. The objective function values are weighted sums of the residuals, stored in NAM_MF_DoE_hres.csv, NAM_MF_DoE_mres.csv, NAM_MF_DoE_rres.csv, according to the distance to the predictions and the results are stored in NAM_MF_DoE_OFh.csv, NAM_MF_DoE_OFm.csv, NAM_MF_DoE_OFr.csv. The threshold values for each objective function and prediction are stored in NAM_MF_OF_thresholds.csv. Python script NAM_MF_OF_wrangling.py further post-processes this information to generate the acceptance rates, saved in spreadsheet NAM_MF_dmax_Predictions_ARs.csv

Python script NAM_MF_CreatePosterior.py selects the results from the design of experiment run that satisfy the acceptance criteria. The results form the posterior predictive distributions stored in NAM_MF_dmax_Posterior.csv and NAM_MF_tmax_Posterior.csv. These are further summarised in NAM_MF_Predictions_summary.csv.

The sensitivity analysis is done with script NAM_MF_SI.py, which uses the results of the design of experiment together with the parameter values, stored in NAM_MF_DoE_Parameters.csv and their description (name, range, transform) in NAM_MF_Parameters.csv. The resulting sensitivity indices for dmax, tmax and river, head and minewater flow observations are stored in NAM_MF_SI_dmax.csv, NAM_MF_SI_tmax.csv, NAM_MF_SI_river.csv, NAM_MF_SI_mine.csv and NAM_MF_SI_head.csv. The intermediate files, ending in xxxx, are the results grouped per 100 predictions. The scripts NAM_MF_SI_collate.py and NAM_MF_SI_collate.slurm collate these.

Dataset Citation

Bioregional Assessment Programme (2017) Namoi groundwater uncertainty analysis. Bioregional Assessment Derived Dataset. Viewed 11 December 2018, http://data.bioregionalassessments.gov.au/dataset/36bd27e9-58d2-4bf2-8e4a-54b22ac98cfb.

Dataset Ancestors

Derived From NSW Office of Water GW licence extract linked to spatial locations NIC v2 (28 February 2014)

Derived From Namoi hydraulic conductivity measurements

Derived From Namoi NGIS Bore analysis for 2012

Derived From Namoi groundwater model alluvium extent

Derived From Surface Geology of Australia, 1:1 000 000 scale, 2012 edition

Derived From Namoi Leapfrog geological model

Derived From Historical Mining Footprints DTIRIS NAM 20150914

Derived From Gippsland Project boundary

Derived From Bioregional Assessment areas v04

Derived From Natural Resource Management (NRM) Regions 2010

Derived From Soil and Landscape Grid National Soil Attribute Maps - Clay 3 resolution - Release 1

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014

Derived From GEODATA TOPO 250K Series 3

Derived From NSW Office of Water Groundwater Licence Extract NIC- Oct 2013

Derived From Geological Provinces - Full Extent

Derived From Bioregional Assessment areas v03

Derived From BOM, Australian Average Rainfall Data from 1961 to 1990

Derived From GIS analysis of HYDMEAS - Hydstra Groundwater Measurement Update: NSW Office of Water - Nov2013

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Australian 0.05Âº gridded chloride deposition v2

Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013

Derived From Namoi dryland diffuse groundwater recharge

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From Namoi groundwater model

Derived From Namoi bore locations, depth to water for June 2012

Derived From NSW Office of Water Groundwater Entitlements Spatial Locations

Derived From Victoria - Seamless Geology 2014

Derived From Namoi NSW Office of Water groundwater licence BA purpose
h
Magicoder-Evol-Instruct-110K-python
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2024
Authors
pxy
Description
Dataset Card for "Magicoder-Evol-Instruct-110K-python"

from datasets import load_dataset

Load your dataset

dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

Define a filter function

def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.
P
Niko Chord Progression Dataset Dataset
paperswithcode.com
Updated Aug 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Yi; Haochen Hu; Jingwei Zhao; Gus Xia (2022). Niko Chord Progression Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/niko-chord-progression-dataset
Explore at:
Dataset updated
Aug 31, 2022
Authors
Li Yi; Haochen Hu; Jingwei Zhao; Gus Xia
Description
Introduction The Niko Chord Progression Dataset is used in AccoMontage2. It contains 5k+ chord progression pieces, labeled with styles. There are four styles in total: Pop Standard, Pop Complex, Dark and R&B. Some progressions have an 'Unknown' style. Some statistics are provided below.

Mean Variance
Note Pitch 57 167.70
Note Velocity 79.05 457.89
Note Duration (in seconds) 1.38 1.62

Data Formats You can access the Niko Chord Progression Dataset in two formats: MIDI format and the quantized note matrix format.

MIDI (dataset.zip) Each chord progression piece is stored as a single MIDI file.

Quantized Note Matrix (dataset.pkl) A python dictionary with format like the following. nmatis an 2-d matrix, each row represent a quantized note: [start, end, pitch, velocity]. Each note is quantized at the eighth note level. eg., start=2 means the note begins at the third eighth note. root is also an 2-d matrix. It labels the roots of the chords using an eighth note sample rate. Each row of the root represents a bar. Each element is an integer ranged from 0 (C note) to 11 (B note).

{'piece name': {'nmat': [[0, 3, 60, 60], ...], # 2-d matrix: note matrix 'root': [[0,0,0,0,0,0,0,0], ...], # 2-d matrix: root label 'style': 'some style', # pop_standard, pop_complex, dark, r&b, unknown 'mode': 'some mode', # M, m 'tonic': 'some tonic' # C, Db, ... B }, ... } load the dataset using pickle import pickle with open('dataset_path_and_name.pkl', 'rb') as file: dataset = pickle.load(file)

Supplementary description Original Dataset The Niko Chord Progression Dataset is a re-organized version of the original Niko Dataset. The original Niko Dataset have duplicate progressions and unnecessary labels, it was thus processed and converted to this version.

Style Mapping The style label was mapped from the original dataset to the new dataset. The style label in the original dataset is stored as folder names, and thus the style can be obtained from the file path. The following shows a detailed description of the style mapping function.

// Structure of the original dataset . ├─A Major - F# Minor ---> progressions are sorted based on tonics and modes │ ├─1 - Best Melodies ---> eliminated │ │ ├─Catchy │ │ ├─Dark_HipHop_Trap │ │ ├─EDM │ │ ├─Emotional │ │ ├─Pop │ │ └─R&B_Neosoul │ ├─2 - Best Chords │ │ ├─Dark_HipHop_Trap ---> New style: Dark │ │ ├─EDM │ │ │ ├─Classy_7th_9th ---> New style: Pop Complex │ │ │ ├─Emotional ---> New style: Pop Complex │ │ │ └─Standard ---> New style: Pop Standard │ │ ├─Emotional ---> New style: Pop Complex │ │ ├─Pop │ │ │ ├─Classy_7th_9th ---> New style: Pop Complex │ │ │ ├─Emotional ---> New style: Pop Complex │ │ │ └─Standard ---> New style: Pop Standard │ │ └─R&B_Neosoul ---> New style: R&B │ └─3 - Rest Of Pack │ ├─A-Bm-D (I-ii-IV) ---> progressions sorted based on root pattern │ │ ├─Arps ---> eliminated │ │ ├─Basslines ---> eliminated │ │ ├─Chord Breakdown ---> New style: Unknown │ │ ├─Chord Progression -> New style: Unknown │ │ ├─Epic Endings ---> eliminated │ │ ├─Fast Chord Rhythm -> eliminated │ │ │ ├─Back & Forth │ │ │ └─Same Time
│ │ ├─Melodies ---> eliminated │ │ │ ├─115-130bpm │ │ │ ├─130-160bpm │ │ │ ├─160-180bpm │ │ │ └─90-115bpm │ │ └─Slow Chord Rhythm -> New style: Unknown ...

Cite L. Yi, H. Hu, J. Zhao, and G. Xia, “AccoMontage2: A Complete Harmonization and Accompaniment Arrangement System”, in Proceedings of the 23rd International Society for Music Information Retrieval Conference, Bengaluru, India, 2022.

License MIT Licensed. Copyright © 2022 New York University Shanghai Music X Lab. All rights reserved.
P
PhysioNet Challenge 2020 Dataset
paperswithcode.com
Updated Dec 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna (2020). PhysioNet Challenge 2020 Dataset [Dataset]. https://paperswithcode.com/dataset/physionet-challenge-2020
Explore at:
Dataset updated
Dec 30, 2020
Authors
Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna
Description
Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.

The second source set is the public dataset from St Petersburg INCART 12-lead Arrhythmia Database. This database consists of 74 annotated recordings extracted from 32 Holter records. Each record is 30 minutes long and contains 12 standard leads, each sampled at 257 Hz.

The third source from the Physikalisch Technische Bundesanstalt (PTB) comprises two public databases: the PTB Diagnostic ECG Database and the PTB-XL, a large publicly available electrocardiography dataset. The first PTB database contains 516 records (male: 377, female: 139). Each recording was sampled at 1000 Hz. The PTB-XL contains 21,837 clinical 12-lead ECGs (male: 11,379 and female: 10,458) of 10 second length with a sampling frequency of 500 Hz.

The fourth source is a Georgia database which represents a unique demographic of the Southeastern United States. This training set contains 10,344 12-lead ECGs (male: 5,551, female: 4,793) of 10 second length with a sampling frequency of 500 Hz.

The fifth source is an undisclosed American database that is geographically distinct from the Georgia database. This source contains 10,000 ECGs (all retained as test data).

All data is provided in WFDB format. Each ECG recording has a binary MATLAB v4 file (see page 27) for the ECG signal data and a text file in WFDB header format describing the recording and patient attributes, including the diagnosis (the labels for the recording). The binary files can be read using the load function in MATLAB and the scipy.io.loadmat function in Python; please see our baseline models for examples of loading the data. The first line of the header provides information about the total number of leads and the total number of samples or points per lead. The following lines describe how each lead was saved, and the last lines provide information on demographics and diagnosis. Below is an example header file A0001.hea:

A0001 12 500 7500 05-Feb-2020 11:39:16 A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I A0001.mat 16+24 1000/mV 16 0 7 2029 0 II A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1 A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2 A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3 A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4 A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5 A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6 Age: 74 Sex: Male Dx: 426783006 Rx: Unknown Hx: Unknown Sx: Unknown

From the first line, we see that the recording number is A0001, and the recording file is A0001.mat. The recording has 12 leads, each recorded at 500 Hz sample frequency, and contains 7500 samples. From the next 12 lines, we see that each signal was written at 16 bits with an offset of 24 bits, the amplitude resolution is 1000 with units in mV, the resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. The first value of the signal, the checksum, and the lead name are included for each signal. From the final 6 lines, we see that the patient is a 74-year-old male with a diagnosis (Dx) of 426783006. The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown.

Each ECG recording has one or more labels from different type of abnormalities in SNOMED-CT codes. The full list of diagnoses for the challenge has been posted here as a 3 column CSV file: Long-form description, corresponding SNOMED-CT code, abbreviation. Although these descriptions apply to all training data there may be fewer classes in the test data, and in different proportions. However, every class in the test data will be represented in the training data.
s
Dataset for "Skyrmion states in thin confined polygonal nanostructures"
eprints.soton.ac.uk
data.niaid.nih.gov
+1more
Updated Nov 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hovorka, Ondrej; Albert, Maximilian; Wang, Weiwei; Kluyver, Thomas; Carey, Rebecca; Fangohr, Hans; Pepper, Ryan Alexander; Vousden, Mark; Beg, Marijan; Cortes-Ortuno, David; Bisotti, Marc-Antonio (2017). Dataset for "Skyrmion states in thin confined polygonal nanostructures" [Dataset]. http://doi.org/10.5281/zenodo.1066792
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1066792
Dataset updated
Nov 27, 2017
Dataset provided by
Zenodo
Authors
Hovorka, Ondrej; Albert, Maximilian; Wang, Weiwei; Kluyver, Thomas; Carey, Rebecca; Fangohr, Hans; Pepper, Ryan Alexander; Vousden, Mark; Beg, Marijan; Cortes-Ortuno, David; Bisotti, Marc-Antonio
Description
This dataset provides micromagnetic simulation data collected from a series of computational experiments on the effects of polygonal system shape on the energy of different magnetic states in FeGe. The data here form the results of the study ‘Skyrmion states in thin confined polygonal nanostructures.’ The dataset is split into several directories: Data square-samples and triangle-samples These directories contain final state ‘relaxed’ magnetization fields for square and triangle samples respectively. The files within are organised into directories such that a sample of side length d = 40nm and which was subjected to an applied field of 500mT is labelled d40b500. Within each directory are twelve VTK unstructured grid format files (with file extension “.vtu”). These can be viewed in a variety of programmes; as of the time of writing we recommend either ParaView or MayaVi. The twelve files correspond to twelve simulations for each sample simulated, corresponding to twelve states from which each sample was relaxed - these are described in the paper which this dataset accompanies, but we note the labels are: ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘h’, ‘u’, ‘r1’, ‘r2’, ‘r3’, ‘h2’, ‘h3’ where: 0 - 4 are incomplete to overcomplete skyrmions, h, h2 and h3 are helical states with different periodicities r1-r3 are different random states u is the uniform magnetisation The vtu files are labelled according to parameters used in the simulation. For example, a file labelled ‘160_10_3_0_u_wd000000.vtu’ encodes that: The simulation was of a sample with side length 160nm. The simulation was of a sample of thickness 10nm. The maximum length of an edge in the finite element mesh of the sample was 3nm. The system was relaxed from the ‘u’. ‘wd’ encodes that the simulation was performed with a full demagnetizing calculation. square-npys and triangle-npys These directories contain computed information about each of the final states stored in square-samples and triangle-samples. This information is stored in NumPy npz files, and can be read in Python straightforwardly using the function numpy.load. Within each npz file, there are 8 arrays, each with 12 elements. These arrays are: ‘E’ - corresponds to the total energy of the relaxed state. ‘E_exchange’ - corresponds to the Exchange energy of the relaxed state. ‘E_demag’ - corresponds to the Demagnetizing energy of the relaxed state. ‘E_dmi’ - corresponds to the Dzyaloshinskii-Moriya energy of the relaxed state. ‘E_zeeman’ - corresponds to the Zeeman energy of the relaxed state. ‘S’ - Calculated Skyrmion number of the relaxed state. ‘S_abs’ - Calculated absolute Skyrmion number - see paper for calculation details. ‘m_av’ - Computed normalised average magnetisation in x, y, and z directions for relaxed state The twelve elements here correspond to the aforementioned twelve states relaxed from, and the ordering of the array is that of the order given above. square-classified and triangle-classified These directories contain a labelled dataset which gives details about what the final state in each simulation is. The files are stored as plain text, and are labelled with the following structure (the meanings of which are defined in the paper which this dataset accompanies): iSk - Incomplete Skyrmion Sk, or a number n followed by Sk - n Skyrmions in the state. He - A helical state Target - A target state. The files contain the names of png files which are generated from the vtu files in the format ‘d_165b_350_2.png’. This example, if found in the ‘Sk.txt’ file, means that the sample which was 165nm in side length and which was relaxed under a field of 350mT from initial state 2 was found at equilibrium in a Skyrmion state. Figures square-pngs and triangle-pngs These directories contain generated pngs from the vtu files. These are included for convenience as they take several hours to generate. Each directory contains three subdirectories: all-states This directory contains the simulation results from all samples, in the format ‘d_165b_350_2.png’, which means that the image contained here is that of the 165nm side length sample relaxed under a 350mT field from initial state 2. ground-state This directory contains the images which correspond to the lowest energy state found from all of the initial states. These are labelled as ‘d_180b_50.png’, such that the image contained in this file is the the lowest energy state found from all twelve simulations of the 180nm sidelength under a 50mT field. uniform-state This directory contains the images which correspond to the states relaxed only from the uniform state. These are labelled such that an image labelled ‘d_55b_100.png’ is the state found from relaxing a 180nm sample under a 100mT applied field. phase-diagrams These are the generated phase diagrams which are found in the paper. scripts This folder contains Python scripts which generate the png files mentioned above, and also the phase diagram figures for the paper this dataset accompanies. The scripts are labelled descriptively with what they do - for e.g. ’triangle-generate-png-all-states.py’ contains the script which loads vtu files and generates the png files. The exception here is ’render.py’ which provides functions used across multiple scripts. These scripts can be modified - for example; the function 'export_vector_field' has many options which can be adjusted to, for example, plot different components of the magnetization. In order to run the scripts reproducibly, in the root directory we have provided a Makefile which builds each component. In order to reproduce the figures yourself, on a Linux system, ParaView must be installed. The Makefile has been tested on Ubuntu 16.04 with ParaView 5.0.1. In addition, a number of Python dependencies must also be installed. These are: scipy >=0.19.1 numpy >= 1.11.0 matplotlib == 1.5.2 pillow>=3.1.2 We have included a requirements.txt file which specifies these dependencies; they can be installed by running 'pip install -r requirements.txt' from the directory. Once all dependencies are installed, simply run the command ‘make’ from the shell to build the Docker image and generate the figures. Note the scripts will take a long time to run - at the time of writing the runtime will be on the order of several hours on a high-specification desktop machine. For convenience, we have therefore included the generated figures within the repository (as noted above). It should be noted that for the versions used in the paper, adjustments have been made after the generation of the figures, (for e.g. to add images of states within the metastability figure, and overlaying boundaries in the phase diagrams). If you want to reproduce only the phase diagrams, and not the pngs, the command ‘make phase-diagrams’ will do so. This is the smallest part of the figure reproduction, and takes around 5 minutes on a high-specification desktop.
4
Data from: Data and scripts underlying the publication: Quantifying the...
data.4tu.nl
zip
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max van Mulken; J.A.J. (Jasper) Eikelboom (2025). Data and scripts underlying the publication: Quantifying the Spatial Scales of Animal Clusters Using Density Surfaces [Dataset]. http://doi.org/10.4121/61be5dd9-7880-48dc-bacf-36afbc3033ee.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/61be5dd9-7880-48dc-bacf-36afbc3033ee.v1
Dataset updated
May 28, 2025
Dataset provided by
4TU.ResearchData
Authors
Max van Mulken; J.A.J. (Jasper) Eikelboom
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
2014
Area covered

Description
Supplementary scripts to the publication "Quantifying the spatial scales of animal clustering using Density Surfaces"

We implement a method to quantify the degree of clustering of point location data at different spatial scales,
which uses Kernel Density Estimation to construct a density function from the underlying point-location data.
We build upon this method to automatically detect cluster diameters using smoothing kernels that better represent the perception neighbourhood of animals.
More details can be found in the manuscript.

These scripts construct the artificial data sets and results in the figures in the main text of the manuscript.

data_generator.py
This file contains the functions to construct the artificial data sets, as well as visualization tools to plot the point sets.
Running the main() function:
1. constructs all artificial data sets
2. creates visualizations of all generated and real-life datasets, saves them as .pdf files, and shows them on-screen

metric_calculator.py
This file contains the functions to calculate the metric described in the manuscript, as well as to compute Ripley's K function
and the Radial Distribution Function.
Running the main() function:
1. generates the metric functions for all artificial and real-life data sets
2. creates visualizations of all generated metric functions, saves them as .pdf files, and shows them on-screen
3. prints the found relevant spatial scales, and their metric values, in the terminal

elephant.pickle
This file contains the real-world dataset of elephant locations to be used in metric_calculator.py
The original data was collected in March 2014 in the Tsavo National Parks, Kenya.
We use a subset of the original data set, consisting of location data of 24 elephants obtained from an aerial image that were manually taken by human observers upon spotting the animals.
The aerial image was manually processed into spatial data by placing a point on the approximate centre point of each animal in the image, and projected onto a 100x100 xy-plane.
The data is serialized and de-serialized using the native Python package "pickle". The data format used by pickle is Python-specific.

To perform the experiments:
1. Ensure you have a functioning Python3 installation.
2. Install the required packages using pip:
- numpy
- matplotlib
- scipy
- scikit-learn
3. Run the main() function in data_generator.py to generate the artificial datasets
4. Run the main() function in metric_calculator.py to generate the metric functions and figures
d
CLM AWRA HRVs Uncertainty Analysis
data.gov.au
researchdata.edu.au
+1more
Updated Nov 19, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). CLM AWRA HRVs Uncertainty Analysis [Dataset]. https://data.gov.au/data/dataset/e51a513d-fde7-44ba-830c-07563a7b2402
Explore at:
Dataset updated
Nov 19, 2019
Dataset provided by
Bioregional Assessment Program
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).

Dataset History

File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).

R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:

CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions

CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)

CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations

Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):

CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)

CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available

CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)

CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV

R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).

The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.

Dataset Citation

Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.

Dataset Ancestors

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204

Derived From Qld 100K mapsheets - Mount Lindsay

Derived From Qld 100K mapsheets - Helidon

Derived From Qld 100K mapsheets - Ipswich

Derived From CLM - Woogaroo Subgroup extent

Derived From CLM - Interpolated surfaces of Alluvium depth

Derived From CLM - Extent of Logan and Albert river alluvial systems

Derived From CLM - Bore allocations NSW v02

Derived From CLM - Bore allocations NSW

Derived From CLM - Bore assignments NSW and QLD summary tables

Derived From CLM - Geology NSW & Qld combined v02

Derived From CLM - Orara-Bungawalbin bedrock

Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014

Derived From CLM groundwater model hydraulic property data

Derived From CLM - Koukandowie FM bedrock

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From NSW Office of Water - National Groundwater Information System 20140701

Derived From CLM - Gatton Sandstone extent

Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014

Derived From Bioregional Assessment areas v03

Derived From NSW Geological Survey - geological units DRAFT line work.

Derived From Mean Annual Climate Data of Australia 1981 to 2012

Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE)

Derived From Qld 100K mapsheets - Caboolture

Derived From CLM - AWRA Calibration Gauges SubCatchments

Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901

Derived From Qld 100k mapsheets - Murwillumbah

Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013

Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model

Derived From CLM - Bore water level NSW

Derived From Climate model 0.05x0.05 cells and cell centroids

Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers

Derived From CLM - Metgasco 3D geological model formation top grids

Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project

Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111

Derived From Qld 100K mapsheets - Esk

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From CLM - Qld Surface Geology Mapsheets

Derived From NSW Office of Water Pump Test dataset

Derived From [CLM -
Z
Spiking Seizure Classification Dataset
data.niaid.nih.gov
Updated Jan 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gallou, Olympia (2025). Spiking Seizure Classification Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10800793
Explore at:
Dataset updated
Jan 13, 2025
Dataset provided by
Matthew, Cook
Gallou, Olympia
Bartels, Jim
Sarnthein, Johannes
Indiveri, Giacomo
GHOSH, SAPTARSHI
Ito, Hiroyuki
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset for event encoded analog EEG signals for detection of Epileptic seizures

This dataset contains events that are encoded from the analog signals recorded during pre-surgical evaluations of patients at the Sleep-Wake-Epilepsy-Center (SWEC) of the University Department of Neurology at the Inselspital Bern. The analog signals are sourced from the SWEC-ETHZ iEEG Database

This database contains event streams for 10 seizures recorded from 5 patients and generated by the DYnamic Neuromorphic Asynchronous Processor (DYNAP-SE2) to demonstrate a proof-of-concept of encoding seizures with network synchronization. The pipeline consists of two parts (I) an Analog Front End (AFE) and (II) an SNN termed as"Non-Local Non-Global" (NLNG) network.

In the first part of the pipeline, the digitally recorded signals from SWEC-ETHZ iEEG Database are converted to analog signals via an 18-bit Digital-to-Analog converter (DAC) and then amplified and encoded into events by an Asynchronous Delta Modulator (ADM). Then in the second part, the encoded event streams are fed into the SNN that extracts the features of the epileptic seizure by extracting the partial synchronous patterns intrinsic to the seizure dynamics.

Details about the neuromorphic processing pipeline and the encoding process are included in a manuscript under review. The preprint is available in bioRxiv

InstallationThe installation requires Python>=3.x and conda (or py-venv) package. Users can then install the requirements inside a conda environment using

conda env create -f requirements.txt -n sez

Once created the conda environment can be activated with conda activate sez

The main files in the database are described in the hierarchy below.

EventSezDataset/

├─ data/

│ ├─ P x S x

│ │ ├─ Pat x Sz x _CH x .csv

├─ LSVM_Params/

│ ├─ opt_svm_params/

│ ├─ pat_x_features_SYNCH/

├─ fig_gen.py

├─ sync_mat_gen.py

├─ SeizDetection_FR.py

├─ SeizDetection_SYNCH.py

├─ support.py

├─ run.sh

├─ requirements.txt

where x represents the Patient ID and the Seizure ID respectively.

requirements.txt: This file lists the requirements for the execution of the Python code.

fig_gen.py: This file plots the analog signals and the associated AFE and NLNG event streams. The execution of the code happens with `python fig_gen.py 1 1 13', where patient 2, seizure 1, and channel 13 of the recording are plotted.

sync_mat_gen.py: This file describes the function for plotting the synchronization matrices emerging from the ADM and the NLNG spikes with either linear or log colorbar. The execution of the code happens with python sync_mat_gen.py 1 1' orpython sync_mat_gen.py 1 1 log'. This execution generated four figures for pre-seizure, First Half of seizure, Second Half of seizure, and post-seizure time periods, where patient 1 and seizure 1. The third option can either be left blank or input as lin or log, for respective color bar scales. The time is the signal-time as mentioned in the table below.

run.sh: A simple Linux script to run the above code for all patients and seizures.

SeizDetection_FR.py: This file runs the LSVM on the ADM and NLNG spikes, using the firing rate (FR) as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/opt_svm_params/ folder). Users can use the code for training the LSVM with different parameters as well.

SeizDetection_SYNCH.py: This file runs the LSVM on the kernelized ADM and NLNG spikes, using the flattened SYNC matrices as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/pat_x_features_SYNCH/ folder). Users can use the code for training the LSVM with different parameters as well.

LSVM_Params: Folder containing LSVM features with different parameter combinations.

support.py: This file contains the necessary functions.

data/P1S1/: This folder, for example, contains the event streams for all channels for seizure 1 of patient 1.

Pat1_Sz_1_CH1.csv: This file contains the spikes of the AFE and the NLNG layers with the following tabular format (which can be extracted by the fig_gen.py)

Comments

SStart: 180 //Start of the Seizure in signal time# SEnd: 276.0 //Start of the Seizure in signal time# Pid: 2 // The patient ID as per the SWEC-ETHZ iEEG Database # Sid: 1 // The Seizure ID as per the SWEC-ETHZ iEEG Database # Channel_No: 1 // The channel number

SYS_time signal_time dac_value ADMspikes NLNGspikes

The time from the interface FPGA The time of the signal as per the SWEC ETHZ Database The value of the analog signals as recorded in the SWEC ETHZ Database The event-steam is the output of the AFE in boolean format. True represents a spike The spike-steam is the output of the SNN in boolean format. True represents a spike

	Mean	Variance
Note Pitch	57	167.70
Note Velocity	79.05	457.89
Note Duration (in seconds)	1.38	1.62

Facebook

Twitter

Click to copy link

Link copied

Cite

Farouk (2023). SPP_30K_reasoning_tasks [Dataset]. https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks

SPP_30K_reasoning_tasks

SPP python reasoning tasks

pharaouk/SPP_30K_reasoning_tasks

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 20, 2023

Authors

Farouk

Description

Dataset Card for "SPP_30K_verified_tasks"

  Dataset Summary

This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.

Clear search

Close search

Google apps

Main menu

SPP_30K_reasoning_tasks

CodeSearchNet Dataset

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Python-codes

MNIST IDX Dataset- Fasion

🔍 What’s Inside?

🧩 Sample Image

Python functions -- cross-validation methods from a data-driven perspective

Software Defects Dataset 1k

HUN GW Uncertainty Analysis v01

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

DNP3 Intrusion Detection Dataset

Klib library python

Usage

Examples

Contributing

License

In-vitro dataset for classification and regression of stenosis: dependence...

Texas Synthetic Power System Test Case (TX-123BT).zip

Namoi groundwater uncertainty analysis

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Magicoder-Evol-Instruct-110K-python

Load your dataset

Define a filter function

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

Niko Chord Progression Dataset Dataset

PhysioNet Challenge 2020 Dataset

Dataset for "Skyrmion states in thin confined polygonal nanostructures"

Data from: Data and scripts underlying the publication: Quantifying the...

CLM AWRA HRVs Uncertainty Analysis

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Spiking Seizure Classification Dataset

Comments

SStart: 180 //Start of the Seizure in signal time# SEnd: 276.0 //Start of the Seizure in signal time# Pid: 2 // The patient ID as per the SWEC-ETHZ iEEG Database # Sid: 1 // The Seizure ID as per the SWEC-ETHZ iEEG Database # Channel_No: 1 // The channel number

SPP_30K_reasoning_tasksSee More Versions

SPP python reasoning tasks

pharaouk/SPP_30K_reasoning_tasks

SPP_30K_reasoning_tasks