Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.
The dataset includes the following files:
Each file is structured to support the training and evaluation of seismic denoising models.
The HDF5 files named noise contain two main datasets:
Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:
To load these files in a Python environment, use the following approach:
```python
import h5py
import numpy as np
# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))
if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces
if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```
Ensure that the path to the file is correctly specified relative to your Python script.
To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:
```bash
pip install numpy
pip install h5py
```
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Offline reinforcement learning (RL) is a promising direction that allows RL agents to be pre-trained from large datasets avoiding recurrence of expensive data collection. To advance the field, it is crucial to generate large-scale datasets. Compositional RL is particularly appealing for generating such large datasets, since 1) it permits creating many tasks from few components, and 2) the task structure may enable trained agents to solve new tasks by combining relevant learned components. This submission provides four offline RL datasets for simulated robotic manipulation created using the 256 tasks from CompoSuite Mendez et al., 2022. In every task in CompoSuite, a robot arm is used to manipulate an object to achieve an objective all while trying to avoid an obstacle. There are for components for each of these four axes that can be combined arbitrarily leading to a total of 256 tasks. The component choices are * Robot: IIWA, Jaco, Kinova3, Panda* Object: Hollow box, box, dumbbell, plate* Objective: Push, pick and place, put in shelf, put in trashcan* Obstacle: None, wall between robot and object, wall between goal and object, door between goal and object The four included datasets are collected using separate agents each trained to a different degree of performance, and each dataset consists of 256 million transitions. The degrees of performance are expert data, medium data, warmstart data and replay data: * Expert dataset: Transitions from an expert agent that was trained to achieve 90% success on every task.* Medium dataset: Transitions from a medium agent that was trained to achieve 30% success on every task.* Warmstart dataset: Transitions from a Soft-actor critic agent trained for a fixed duration of one million steps.* Medium-replay-subsampled dataset: Transitions that were stored during the training of a medium agent up to 30% success. These datasets are intended for the combined study of compositional generalization and offline reinforcement learning. Methods The datasets were collected by using several deep reinforcement learning agents trained to the various degrees of performance described above on the CompoSuite benchmark (https://github.com/Lifelong-ML/CompoSuite) which builds on top of robosuite (https://github.com/ARISE-Initiative/robosuite) and uses the MuJoCo simulator (https://github.com/deepmind/mujoco). During reinforcement learning training, we stored the data that was collected by each agent in a separate buffer for post-processing. Then, after training, to collect the expert and medium dataset, we run the trained agents for 2000 trajectories of length 500 online in the CompoSuite benchmark and store the trajectories. These add up to a total of 1 million state-transitions tuples per dataset, totalling a full 256 million datapoints per dataset. The warmstart and medium-replay-subsampled dataset contain trajectories from the stored training buffer of the SAC agent trained for a fixed duration and the medium agent respectively. For medium-replay-subsampled data, we uniformly sample trajectories from the training buffer until we reach more than 1 million transitions. Since some of the tasks have termination conditions, some of these trajectories are trunctated and not of length 500. This sometimes results in a number of sampled transitions larger than 1 million. Therefore, after sub-sampling, we artificially truncate the last trajectory and place a timeout at the final position. This can in some rare cases lead to one incorrect trajectory if the datasets are used for finite horizon experimentation. However, this truncation is required to ensure consistent dataset sizes, easy data readability and compatibility with other standard code implementations. The four datasets are split into four tar.gz folders each yielding a total of 12 compressed folders. Every sub-folder contains all the tasks for one of the four robot arms for that dataset. In other words, every tar.gz folder contains a total of 64 tasks using the same robot arm and four tar.gz files form a full dataset. This is done to enable people to only download a part of the dataset in case they do not need all 256 tasks. For every task, the data is separately stored in an hdf5 file allowing for the usage of arbitrary task combinations and mixing of data qualities across the four datasets. Every task is contained in a folder that is named after the CompoSuite elements it uses. In other words, every task is represented as a folder named
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
4D-STEM data frequently requires a number of calibrations in order to make accurate measurement: for instance, in various cases, it can be essential to measure and correct for diffraction shifts, account for ellipticity in the diffraction patterns, or determine the rotational offset between the real and diffraction planes.
We've prepared a simulated 4D-STEM dataset which includes diffraction shifting, elliptical distortion, and an r-space/k-space rotational offset. Two HDF5 files each include the simulated data for two different electron probes: a standard probe, using a circular probe-forming aperture, and a 'bullseye' probe, using a patterned aperture. Each HDF5 file contains the following data objects:
(a) the 'experimental' 4D-STEM scan of a strained single-crystal gold nanoparticle (size: (100,84,250,250) )
(b) a 4D-STEM scan of a calibration sample of polycrystalline gold (size: (100,84,250,250) )
(c) a stack of diffraction images of the electron probe over vacuum (size: (250,250,20) )
(d) a single image of the electron probe over the sample and far from focus, such that the CBED forms a shadow image (size: (512,512) )
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All raw data and metadata of ptychography scans are assembled into HDF5 files. These include acquired frames X-ray pixel array detectors, parameters, component positions and settings of the instruments. The data and metadata curation follows the convention of NeXus-NXsas as close as practical.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This provides a curated hdf5 file for a subset of the SPICE 1 OpenFF dataset (Open Force Field initiative default level of theory) designed to be compatible with modelforge, an infrastructure to implement and train NNPs. This subset is limited to molecules containing any of the following 7 elements: H, C, N, O, F, Cl, and S. This datafile includes 1000 total conformers for 100 unique molecules.
Changes: In this version, for each record `total_charge` is stored as an array of shape (N_conformers, 1), i.e., a value for each conformer; previously this was just a single value for each record as the charge state doesn't change for each conformer.
When applicable, the units of properties are provided in the datafile, encoded as strings compatible with the openff-units package. For more information about the structure of the data file, please see the following:
This curated dataset was generated using the modelforge software at commit
Small-molecule/Protein Interaction Chemical Energies (SPICE).
The SPICE dataset contains 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated using B3LYP-D3BJ/DZVP level of theory, using Psi4 1.4.1.
This is the default theory used for force field development by the Open Force Field Initiative.
This includes the following collections from the MolSSI qcarchive (these are also included in the standard SPICE 1 dataset):
This does not include the following collections (which are part of the standard SPICE 1 dataset):
Original SPICE 1 publication:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.
It contain annotations about 6 different ECGs abnormalities: - 1st degree AV block (1dAVb); - right bundle branch block (RBBB); - left bundle branch block (LBBB); - sinus bradycardia (SB); - atrial fibrillation (AF); and, - sinus tachycardia (ST).
Companion python scripts are available in: https://github.com/antonior92/automatic-ecg-diagnosis
Citation
Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
Bibtex: ``` @article{ribeiro_automatic_2020, title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network}, author = {Ribeiro, Ant{^o}nio H. and Ribeiro, Manoel Horta and Paix{~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{"o}n, Thomas B. and Ribeiro, Antonio Luiz P.}, year = {2020}, volume = {11}, pages = {1760}, doi = {https://doi.org/10.1038/s41467-020-15432-4}, journal = {Nature Communications}, number = {1} }
ecg_tracings.hdf5
: The HDF5 file containing a single dataset named tracings
. This dataset is a (827, 4096, 12)
tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: {DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}
. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.
In python, one can read this file using the following sequence:
python
import h5py
with h5py.File(args.tracings, "r") as f:
x = np.array(f['tracings'])
attributes.csv
contain basic patient attributes: sex (M or F) and age. It
contain 827 lines (plus the header). The i-th tracing in ecg_tracings.hdf5
correspond to the i-th line.annotations/
: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in ecg_tracings.hdf5
correspond to the in all csv files. The csv files all have 6 columns 1dAVb, RBBB, LBBB, SB, AF, ST
corresponding to weather the annotator have detect the abnormality in the ECG (=1
) or not (=0
).
cardiologist[1,2].csv
contain annotations from two different cardiologist.gold_standard.csv
gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. dnn.csv
prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.cardiology_residents.csv
annotations from two 4th year cardiology residents (each annotated half of the dataset).emergency_residents.csv
annotations from two 3rd year emergency residents (each annotated half of the dataset).medical_students.csv
annotations from two 5th year medical students (each annotated half of the dataset).The Custom Silicone Mask Attack Dataset (CSMAD) contains presentation attacks made of six custom-made silicone masks. Each mask cost about USD 4000. The dataset is designed for face presentation attack detection experiments.
The Custom Silicone Mask Attack Dataset (CSMAD) has been collected at the Idiap Research Institute. It is intended for face presentation attack detection experiments, where the presentation attacks have been mounted using a custom-made silicone mask of the person (or identity) being attacked.
The dataset contains videos of face-presentations, as a set of files specifying the experimental protocol corresponding the experiments presented in the corresponding publication.
Reference
If you publish results using this dataset, please cite the following publication.
Sushil Bhattacharjee, Amir Mohammadi and Sebastien Marcel: "Spoofing Deep Face Recognition With Custom Silicone Masks." in Proceedings of International Conference on Biometrics: Theory, Applications, and Systems (BTAS), 2018.
10.1109/BTAS.2018.8698550
http://publications.idiap.ch/index.php/publications/show/3887
Data Collection
Face-biometric data has been collected from 14 subjects to create this dataset. Subjects participating in this data-collection have played three roles: targets, attackers, and bona-fide clients. The subjects represented in the dataset are referred to here with letter-codes: A .. N. The subjects A..F have also been targets. That is, face-data for these six subjects has been used to construct their corresponding flexible masks (made of silicone). These masks have been made by Nimba Creations Ltd., a special effects company.
Bona fide presentations have been recorded for all subjects A..N. Attack presentations (presentations where the subject wears one of 6 masks) have been recorded for all six targets, made by different subjects. That is, each target has been attacked several times, each time by a different attacker wearing the mask in question. This is one way of increasing the variability in the dataset. Another way we have augmented the variability of the dataset is by capturing presentations under different illumination conditions. Presentations have been captured in four different lighting conditions:
All presentations have been captured with a green uniform background. See the paper mentioned above for more details of the data-collection process.
Dataset Structure
The dataset is organized in three subdirectories: ‘attack’, ‘bonafide’, ‘protocols’. The two directories: ‘attack’ and ‘bonafide’ contain presentation-videos and still images for attacks and bona fide presentations, respectively. The folder ‘protocols’ contains text files specifying the experimental protocol for vulnerability analysis of face-recognition (FR) systems.
The number of data-files per category are as follows:
The folder ‘attack/WEAR’ contains videos where the attack has been made by a person (attacker) wearing the mask of the target being attacked. The ‘attack/STAND’ folder contains videos where the attack has been made using a the target’s mask mounted on an appropriate stand.
Video File Format
The video files for the face-presentations are in ‘hdf5’ format (with file-extensions ‘.h5’. The folder structure of the hdf5 file is shown in Figure 1. Each file contains data collected using two cameras:
As shown in Figure 1, frames from the different channels (color, infrared, depth, thermal) from he two cameras are stored in separate directory-hierarchies in the hdf5 file. Each file respresents a video of approximately 10 seconds, or roughly, 300 frames.
In the hdf5 file, the directory for SR300 also contains a subdirectory named ‘aligned_color_to_depth’. This folder contains post-processed data, where the frames of depth channel have been aligned with those of the color channel based on the time-stamps of the frames.
Experimental Protocol
The ‘protocols’ folder contains text files that specify the protocols for vulnerability analysis experiments reported in the paper mentioned above. Please see the README file in the protocols folder for details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version 1.0, April 2018.
Vincent Lostanlen (1, 2, 3), Justin Salamon (2, 3), Andrew Farnsworth (1), Steve Kelling (1), and Juan Pablo Bello (2, 3).
(1): Cornell Lab of Ornithology (CLO) (2): Center for Urban Science and Progress, New York University (3): Music and Audio Research Lab, New York University
The BirdVox-70k dataset contains 70k half-second clips from 6 audio recordings in the BirdVox-full-night dataset, each about ten hours in duration. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10.
Andrew Farnsworth used the Raven software to pinpoint every avian flight call in time and frequency. He found 35402 flight calls in total. He estimates that about 25 different species of passerines (thrushes, warblers, and sparrows) are present in this recording. Species are not labeled in BirdVox-70k, but it is possible to tell apart thrushes from warblers and sparrrows by looking at the center frequencies of their calls. The annotation process took 102 hours.
The dataset can be used, among other things, for the research,development and testing of bioacoustic classification mode ls, including the reproduction of the results reported in [1].
For details on the hardware of ROBIN recording units, we refer the reader to [2].
[1] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection. Proc. IEEE ICASSP, 2018.
[2] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.
@inproceedings{lostanlen2018icassp, title = {BirdVox-full-night: a dataset and benchmark for avian flight call detection}, author = {Lostanlen, Vincent and Salamon, Justin and Farnsworth, Andrew and Kelling, Steve and Bello, Juan Pablo}, booktitle = {Proc. IEEE ICASSP}, year = {2018}, published = {IEEE}, venue = {Calgary, Canada}, month = {April}, }
BirdVox-70k contains the recordings as HDF5 files, sampled at 24 kHz, with a single channel (mono). Each HDF5 file corresponds to a different sensor. The name of the HDF5 dataset in each file is "waveforms".
Contrary to BirdVox-full-night, BirdVox-70k is not shipped with a metadata file. Rather, the metadata is included in the keys of the elements in the HDF5 files themselves, whose values are the waveforms.
An example of BirdVox-70k key is:
unitID_TIMESTAMP_FREQ_LABEL
where
ID is the identifier of the unit (01, 02, 03, 05, 07, or 10)
TIMESTAMP is the timestamp of the center of the clip in the BirdVox-full-night recording. This timestamp is measured in samples at 24 kHz. It is accurate at about 10 ms.
FREQ is the center frequency of the flight call, measured in Hertz. It is accurate at about 1 kHz. When the clip is negative, i.e. does not contain any flight call, it is set equal to zero by convention.
LABEL is the label of the clip, positive (1) or negative (0).
Example:
unit01_085256784_03636_1
is a positive clip in unit 01, with timestamp 085256784 (3552.37 seconds after dividing by the sample rate 24000), center frequency 3636 Hz.
Another example:
unit05_284775340_00000_0
is a negative clip in unit 05, with timestamp 284775340 (11865.64 seconds).
The approximate GPS coordinates of the sensors (latitudes and longitudes rounded to 2 decimal points) and UTC timestamps corresponding to the start of the recording for each sensor are included as CSV files in the main directory.
When BirdVox-70k is used for academic research, we would highly appreciate it if scientific publications of works partly based on this dataset cite the following publication:
V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
The creation of this dataset was supported by NSF grants 1125098 (BIRDCAST) and 1633259 (BIRDVOX), a Google Faculty Award, the Leon Levy Foundation, and two anonymous donors.
Dataset created by Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello.
The BirdVox-70k dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://creativecommons.org/licenses/by/4.0/
The dataset and its contents are made available on an "as is" basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, Cornell Lab of Ornithology is not liable for, and expressly excludes all liability for, loss or damage however and whenever caused to anyone by any use of the BirdVox-70k dataset or any part of it.
Please help us improve BirdVox-70k by sending your feedback to: vincent.lostanlen@gmail.com and af27@cornell.edu
In case of a problem, please include as many details as possible.
Jessie Barry, Ian Davies, Tom Fredericks, Jeff Gerbracht, Sara Keen, Holger Klinck, Anne Klingensmith, Ray Mack, Peter Marchetto, Ed Moore, Matt Robbins, Ken Rosenberg, and Chris Tessaglia-Hymes.
We acknowledge that the land on which the data was collected is the unceded territory of the Cayuga nation, which is part of the Haudenosaunee (Iroquois) confederacy.
This data set consists of 59 wideband magnetotelluric (MT) stations collected by the U.S. Geological Survey in July and August of 2020 as part of a 1-year project funded by the Energy Resources Program of the U.S. Geological Survey to demonstrate full crustal control on geothermal systems in the Great Basin. Each station had 5 components, 3 orthogonal magnetic induction coils and 2 horizontal orthogonal electric dipoles. Data were collected for an average of 18 hours on a repeating schedule of alternating sampling rates of 256 samples/second for 7 hours and 50 minutes and 4096 samples/second for 10 minutes. The schedules were set such that each station was recording the same schedule to allow for remote reference processing. Data were processed with a bounded-influence robust remote reference processing scheme (BIRRP v5.2.1, Chave and Thomson, 2004). Data quality is good for periods of 0.007 - 2048 with some noise in the higher periods and less robust estimates at the longer periods. Files included in this publication include measured electric- and magnetic-field time series (.h5 files) as well as estimated impedance and vertical-magnetic field transfer functions (.edi files). An image of the MT response is supplied (.png file) where the impedance tensor is plotted on the to two panels, the induction vectors in the middle panel (up is geographic North), and the phase tensor in the bottom panel (up is geographic North). The real induction vectors point towards strong conductors. Phase tensor ellipses align in the direction of electrical current flow and warmer color represents the subsurface becoming more conductive and cooler colors more resistive.
This dataset comprises Distributed Acoustic Sensing (DAS) data collected from the Utah FORGE monitoring well 16B(78)-32 (the producer well) during hydraulic fracture stimulation operations conducted in April 2024. The data were acquired continuously over the stimulation period at a temporal sampling rate of 10,000 Hz (10 kS/s) and a spatial resolution of approximately 3.35 feet (1.02109 meters). The measurements were captured using a Neubrex NBX-S4100 Time Gated Digital DAS interrogator unit connected to a single-mode fiber optic cable, which was permanently installed within the casing string. All recorded channels correspond to downhole segments of the fiber optic cable, from a measured depth (MD) of 5,369.35 feet to 10,352.11 feet. The DAS data reflect raw acoustic energy generated by physical processes within and surrounding the well during stimulation activities at wells 16A(78)-32 and 16B(78)-32. These data have potential applications in analyzing cross-well strain, far-field strain rates (including microseismic activity), induced seismicity, and seismic imaging. Metadata embedded in the attributes of the HDF5 files include detailed information on the measured depths of the channels, interrogation parameters, and other acquisition details. The dataset also includes a recording of a seminar held on September 19, 2024, where Neubrex's Chief Operating Officer presented insights into the data collection, analysis, and preliminary findings. The raw data files, stored in HDF5 format, are organized chronologically according to the recording intervals from April 9 to April 24, 2024, with each file corresponding to a 12-second recording interval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the MESWA (Middle East and Southwest Asia) seismic model and auxiliary data used in the creation of the model (Rodgers, 2023). MESWA is a three-dimensional model of the seismic properties of crust and upper mantle of the Middle East and Southwest Asia. The MESWA model is provided in NetCDF format (readable by for example, xarray, Hoyer & Hamman, 2017) and HDF5 format for viewing with ParaView (Ahrens et al., 2005) and interaction with Salvus (Afanasiev et al., 2019).
Also included are the earthquake source parameters for all 327 Global Centroid Moment Tensor events considered in this study in ASCII text format. Also included are lists of the selected 192 inversion events and 66 validation events in ASCII text format. Lastly, we include a list of all receivers used in the creation and validation of MESWA. This is a simple ASCII file with the event name and receiver name (composed of the network_code and station_code).
The following table provides a listing of the files in the dataset:
File |
Description |
MESWA.nc |
MESWA model in NetCDF format |
MESWA.h5 |
MESWA model in HDF5 format, used by Salvus |
MESWA.xmdf |
Auxiliary file for MESWA.h5, used to import model into Paraview |
events_project.csv |
Table of event source parameters for all 327 events considered in the project |
inversion_events_192.csv |
Table of 192 inversion events (ASCII comma separated value) |
validation_events_66.csv |
Table of 66 validation events (ASCII comma separated value) |
events_receivers_inversion.csv |
Table of waveform (event-receiver-channel) data used in the inversion (ASCII comma separated value) |
events_receivers_validation.csv |
Table of waveform (event-receiver-channel) data used in the validation (ASCII comma separated value) |
References
Afanasiev, M, C Boehm, M van Driel, L Krischer, M Rietmann, DA May, MG Knepley, and A Fichtner (2019). Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophys. J. Int., 216(3), 1675–1692, doi: 10.1093/gji/ggy469
Ahrens, J., Geveci, B., & Law, C. (2005). Paraview: An end-user tool for large data visualization. The Visualization Handbook, 717(8). https://doi.org/10.1016/b978-012387582-2/50038-1
Hoyer, S., & Hamman, J. (2017). Xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software, 5(1). https://doi.org/10.5334/jors.148
Rodgers, A. (2023). Adjoint Waveform Tomography for Crustal and Upper Mantle Structure the Middle East and Southwest Asia for Improved Waveform Simulations Using Openly Available Broadband Data, technical report, LLNL-TR- 851939.
Acknowledgements
This project was support by Lawrence Livermore National Laboratory’s Laboratory Directed Research and Development project 20-ERD-008 and the National Nuclear Security Administration. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-MI-852402
SACLA XFEL experiment 2021/05, Proposal Number: 2021A8026 FeRh magnetism/lattice, All raw data and metadata are assembled into HDF5 files. These include acquired frames of X-ray pixel array detectors, parameters, component positions and settings of the instruments. The data and metadata was obtained from the SACLA database (for the run number relevant for our experiment) with the SACLA data converter.
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
ML2DGM is the EOS Aura Microwave Limb Sounder (MLS) product containing the minor frame diagnostic quantities on a miscellaneous grid. These include items such as tangent pressure, chi-square describing various fits to the measured radiances, number of radiances used in various retrieval phases, etc. This product contains a second auxiliary file which includes cloud-induced radiances inferred for selected spectral channels. The data version is 5.0. Data coverage is from August 8, 2004 to current. Spatial coverage is near-global (-82 degrees to +82 degrees latitude), with each profile spaced 1.5 degrees or ~165 km along the orbit track (roughly 15 orbits per day). Vertical resolution varies between species and typically ranges from 3 - 6 km. Users of the ML2DGM data product should read the EOS MLS Level 2 Version 5 Quality Document for more information. The data are stored in the version 5 Hierarchical Data Format, or HDF5. Each file contains sets of HDF5 dataset objects (n-dimensional arrays) for each diagnostics measurement. The dataset objects represent data and geolocation fields; included in the file are file attributes and metadata. There are two files per day (MLS-Aura_L2AUX-DGM and MLS-Aura_L2AUX-Cloud).
This dataset comprises of the input files and other files required for Advanced Terrestrial Simulator (ATS) simulations at 7 catchments across the continental United States. ATS is an integrated surface-subsurface hydrology model. We include Jupyter notebooks (within scripts folder) for individual catchments showing information (including data sources, river network, soil, geology, landuse types etc.) on preparing the machine readable input files. ATS observation output files are provided in the output folder. Figures and analyses (.xlsx sheets) are also provided. The catchments include, Taylor River Upstream (Colorado); (b) Cossatot River (Arkansas); (c) Panther Creek (Alabama); (d) Little Tennessee River (North Carolina and Georgia); (e) Mayo River (Virginia); (f) Flat Brook (New Jersey); (g) Neversink River headwaters (New York). Readme files are provided inside the directories providing more details. Files types include: .xml, .h5, .xlsx, .png, .ipynb, .py, .nc, .txt. All of the files types can be accessed by open source software, details on software requirements are following: .xml (any text editors including notepad and textedit), .h5 (in python using hdf libraries), .xlsx (WPS Office Spreadsheets, OpenOffice Calc, LibreOffice Calc, Microsoft Office etc.), .png (any image viewer), .ipynb (Jupyter notebook), .py (any text editors including notepad and textedit), .nc (using python or other open source software).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset description
This repository contains the PPMLES (Perturbed-Parameter ensemble of MUST Large-Eddy Simulations) dataset, which corresponds to the main outputs of 200 large-eddy simulations (LES) of microscale pollutant dispersion that replicate the MUST field experiment [Biltoft. 2001, Yee and Biltoft. 2004] for varying meteorological forcing parameters.
The goal of the PPMLES dataset is to provide a comprehensive dataset to better understand the complex interactions between the atmospheric boundary layer (ABL), the urban environment, and pollutant dispersion. It was originally used to assess the impact of the meteorological uncertainty on microscale pollutant prediction and to build a surrogate model that can replace the costly LES model [Lumet et al. 2024b]. The total computational cost of the PPMLES dataset is estimated to be about 6 million core hours.
For each sample of meteorological forcing parameters (inlet wind direction and friction velocity), the AVBP solver code [Schonfeld and Rudgyard. 1999, Gicquel et al. 2011] was used to perform LES at very high spatio-temporal resolution (1e-3s time step, 30cm discretization length) to provide a fine representation of the pollutant concentration and wind velocity statistics within the urban-like canopy. The total computational cost of the PPMLES dataset is estimated to be about 6 million core hours.
File list
The data is stored in HDF5 files, which can be efficiently processed in Python using the h5py module.
input_parameters.h5: list of the 200 input parameter samples (alpha_inlet, ustar) obtained using the Halton sequence that defines the PPMLES ensemble.
ave_fields.h5: lists of the main field statistics predicted by each of the 200 LES samples over the 200-s reference window [Yee and Biltoft. 2004], including:
c: the time-averaged pollutant concentration in ppmv (dim = (n_samples, n_nodes) = (200, 1878585)),
(u, v, w): the time-averaged wind velocity components in m/s,
crms: the root mean square concentration fluctuations in ppmv,
tke: the turbulent kinetic energy in m^2/s^2,
(uprim_cprim, vprim_cprim, wprim_cprim): the pollutant turbulent transport components
uncertainty.h5: lists of the estimated aleatory uncertainty induced by the internal variability of the LES (variability_#) [Lumet et al. 2024a] for each of the fields in ave_fields.h5. Also includes the stationary bootstrap [Politis and Romano. 1994] parameters (n_replicates, block_length) used to estimate the uncertainty for each field and each sample.
mesh.h5: the tetrahedral mesh on which the fields are discretized, composed of about 1.8 millions of nodes.
time_series.h5: HDF5 file consisting of 200 groups (Sample_NNN) each containing the time series of the pollutant concentration (c) and wind velocity components (u, v, w) predicted by the LES sample #NNN at 93 locations.
probe_network.dat: provides the location of each of the 93 probes corresponding to the positions of the experimental campaign sensors [Biltoft. 2001].
Code examples
A) Dataset reading
import h5py import numpy as np
inputf = h5py.File('PPMLES/input_parameters.h5', 'r') input_parameters = np.array((inputf['alpha_inlet'], inputf['friction_velocity'])).T### Load the domain mesh node coordinatesmeshf = h5py.File('../PPMLES/mesh.h5', 'r')mesh_nodes = np.array((meshf['Nodes']['x'], meshf['Nodes']['y'], meshf['Nodes']['z'])).T
var = 'c' # Can be: 'c', 'u', 'v', 'w', 'crms', 'tke', 'uprim_cprim', 'vprim_cprim', or 'wprim_cprim' fieldsf = h5py.File('PPMLES/ave_fields.h5', 'r') fields_list = fieldsf[var] uncertaintyf = h5py.File('PPMLES/uncertainty_ave_fields.h5', 'r') uncertainty_list = uncertaintyf[var]
timeseriesf = h5py.File('PPMLES/time_series.h5', 'r') var = 'c' # Can be: 'c', 'u', 'v', or 'w' probe = 32 # Integer between 0 and 92, see probe_network.csv time_list = [] time_series_list = [] for i in range(200): time_list.append(np.array(timeseriesf[f'Sample_{i+1:03}']['time'])) time_series_list.append(np.array(timeseriesf[f'Sample_{i+1:03}'][var][probe]))
B) Interpolation of one-field from the unstructured grid to a new structured grid
import h5py import numpy as np from scipy.interpolate import griddata
fieldsf = h5py.File('PPMLES/ave_fields.h5', 'r') c = fieldsf['c'][27]
meshf = h5py.File('PPMLES/mesh.h5', 'r') unstructured_nodes = np.array((meshf['Nodes']['x'], meshf['Nodes']['y'], meshf['Nodes']['z'])).T
x0, y0, z0 = -16.9, -115.7, 0. lx, ly, lz = 205.5, 232.1, 20. resolution = 0.75 x_grid, y_grid, z_grid = np.meshgrid(np.linspace(x0, x0 + lx, int(lx/resolution)), np.linspace(y0, y0 + ly, int(ly/resolution)), np.linspace(z0, z0 + lz, int(lz/resolution)), indexing='ij')
c_interpolated = griddata(unstructured_nodes, c, (x_grid.flatten(), y_grid.flatten(), z_grid.flatten()), method='nearest')
C) Expression of all time series over the same time window with the same time discretization
import h5py import numpy as np from scipy.interpolate import griddata
common_time = np.arange(0., 200., 0.05) u_series_list = np.zeros((200, np.shape(common_time)[0]))
timeseriesf = h5py.File('PPMLES/time_series.h5', 'r')
for i in range(200):
sample_time = np.array(timeseriesf[f'Sample_{i+1:03}']['time']) -
np.array(timeseriesf[f'Sample_{i+1:03}']['Parameters']['t_spinup']) # Offset the spinup time
u_series_list[i] = griddata(sample_time, timeseriesf[f'Sample_{i+1:03}']['u'][9], common_time, method='linear')
D) Surrogate model construction example
The training and validation of a POD-GPR surrogate model [Marrel et al. 2015] learning from the PPMLES dataset is given in the following GitHub repository. This surrogate model was successfully used by Lumet et al. 2024b to emulate the LES mean concentration prediction for varying meteorological forcing parameters.
Acknowledgments
This work was granted access to the HPC resources from GENCI-TGCC/CINES (A0062A10822, project 2020-2022). The authors would like to thank Olivier Vermorel for the preliminary development of the LES model, and Simon Lacroix for his proofreading.
This data set contains radar echograms acquired by the University of Alaska Fairbanks High-Frequency Radar Sounder over select glaciers in Alaska. The data are provided in HDF5 formatted files, which include important metadata for interpreting the data. Browse images are also available.
This dataset consists of baseband in-phase/quadrature (I/Q) radio frequency recordings of Wi-Fi and Bluetooth radiated emissions in the 2.4 GHz and 5 GHz unlicensed bands collected with low-cost software defined radios. A NIST technical note describing the data collection methods is pending publication. All I/Q captures are one second in duration, with a sampling rate of 30 mega samples per second (MS/s), and a center frequency of 2437 MHz for the 2.4 GHz band captures and 5825 MHz for the 5 GHz band captures. In total, the data consist of 900 one second captures, organized into five Hierarchical Data Format 5 (HDF5) files, where each HDF5 file has a size of 20.1 GB and consists of 180 one second captures. There is a metadata file associated with each data file in comma-separated values (CSV) format that contains relevant parameters such as center frequency, bandwidth, sampling rate, bit depth, receive gain, antenna and hardware information. There are two additional CSV files containing estimated gain calibration and noise floor values.
This dataset contains turbine- and plant-level power outputs for 252,500 cases of diverse wind plant layouts operating under a wide range of yawing and atmospheric conditions. The power outputs were computed using the Gaussian wake model in NREL's FLOw Redirection and Induction in Steady State (FLORIS) model, version 2.3.0. The 252,500 cases include 500 unique wind plants generated randomly by a specialized Plant Layout Generator (PLayGen) that samples randomized realizations of wind plant layouts from one of four canonical configurations: (i) cluster, (ii) single string, (iii) multiple string, (iv) parallel string. Other wind plant layout parameters were also randomly sampled, including the number of turbines (25-200) and the mean turbine spacing (3D-10D, where D denotes the turbine rotor diameter). For each layout, 500 different sets of atmospheric conditions were randomly sampled. These include wind speed in 0-25 m/s, wind direction in 0 deg.-360 deg., and turbulence intensity chosen from low (6%), medium (8%), and high (10%). For each atmospheric inflow scenario, the individual turbine yaw angles were randomly sampled from a one-sided truncated Gaussian on the interval 0 deg.-30 deg. oriented relative to wind inflow direction. This random data is supplemented with a collection of yaw-optimized samples where FLORIS was used to determine turbine yaw angles that maximize power production for the entire plant. To generate this data, a subset of cases were selected (50 atmospheric conditions from 50 layouts each for a total of additional 2,500 cases) for which FLORIS was re-run with wake steering control optimization. The IEA onshore reference turbine, which has a 130 m rotor diameter, a 110 m hub height, and a rated power capacity of 3.4 MW was used as the turbine for all simulations. The simulations were performed using NREL's Eagle high performance computing system in February 2021 as part of the Spatial Analysis for Wind Technology Development project funded by the U.S. Department of Energy Wind Energy Technologies Office. The data was collected, reformatted, and preprocessed for this OEDI submission in May 2023 under the Foundational AI for Wind Energy project funded by the U.S. Department of Energy Wind Energy Technologies Office. This dataset is intended to serve as a benchmark against which new artificial intelligence (AI) or machine learning (ML) tools may be tested. Baseline AI/ML methods for analyzing this dataset have been implemented, and a link to their repository containing those models has been provided. The .h5 data file structure can be found in the GitHub repository under explore_wind_plant_data_h5.ipynb.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.
The dataset includes the following files:
Each file is structured to support the training and evaluation of seismic denoising models.
The HDF5 files named noise contain two main datasets:
Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:
To load these files in a Python environment, use the following approach:
```python
import h5py
import numpy as np
# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))
if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces
if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```
Ensure that the path to the file is correctly specified relative to your Python script.
To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:
```bash
pip install numpy
pip install h5py
```