100+ datasets found
  1. SELTO Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning for 3D Topology Optimization

    This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

    The following paper provides full documentation and examples:

    Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

    The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
    Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

    For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

    • E - Young's modulus [Pa]
    • ν - Poisson's ratio [-]
    • σ_ys - a yield stress [Pa]
    • h - discretization size of the voxel grid [m]

    The columns of i.csv correspond to the following voxel-wise information:

    • x, y, z - the indices that state the location of the voxel within the voxel mesh
    • Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized
    • Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension
    • F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]
    • density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

    How to Import the Dataset

    with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

    from dl4to.datasets import SELTODataset
    
    dataset = SELTODataset(root=root, name=name, train=train)

    Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

    without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

    import pandas as pd
    
    root = ...
    file_path = f'{root}/{i}.csv'
    columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
    df = pd.read_csv(file_path, names=columns)

    Similarly, we can import a i_info.csv file via:

    file_path = f'{root}/{i}_info.csv'
    info_column_names = ['E', 'ν', 'σ_ys', 'h']
    df_info = pd.read_csv(file_path, names=info_columns)

    We can extract PyTorch tensors from the Pandas dataframe df using the following function:

    import torch
    
    def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
      shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
      voxels = [df['x'].values, df['y'].values, df['z'].values]
    
      Ω_design = torch.zeros(1, *shape, dtype=int)
      Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
    
      Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
      Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
      Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
      Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
    
      F = torch.zeros(3, *shape, dtype=dtype)
      F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
      F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
      F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
    
      density = torch.zeros(1, *shape, dtype=dtype)
      density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
    
      return Ω_design, Ω_Dirichlet, F, density

  2. Stage Two Experiments - Datasets

    • figshare.com
    bin
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke Yerbury (2025). Stage Two Experiments - Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27427629.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Luke Yerbury
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data used in the various stage two experiments in: "Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance". This includes datasets with varied characteristics.All datasets are stored in a dict with tuples of (time series array, class labels). To access data in python:import picklefilename = "dataset.txt"with open(filename, 'rb') as f: data = pickle.load(f)

  3. A demo fluorescence dataset in different formats

    • zenodo.org
    bin, json +1
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christophe Pouzat; Christophe Pouzat; Andreas Pippow; Andreas Pippow; Peter kloppenburg; Peter kloppenburg (2024). A demo fluorescence dataset in different formats [Dataset]. http://doi.org/10.5281/zenodo.10518962
    Explore at:
    json, text/x-python, binAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christophe Pouzat; Christophe Pouzat; Andreas Pippow; Andreas Pippow; Peter kloppenburg; Peter kloppenburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The 4 files contain the same dataset in 4 different formats:

    • Data_POMC.fits (FITS format).
    • Data_POMC;json (JSON format with data as nested arrays suitable for direct import in Python).
    • Data_POMC2.json (JSON format with "flattened" array).
    • Data_POMC.py (a Python module containing a description and a single variable 'stack' with the data as a 3D NumPy array).

    The data are POMC neuron image stack. The CCD chip size (after binning) is 60x80 and 168 fluorescence images were taken. The fluorophore used was Fura-2. Fluorescence images were acquired at 340 nm every 150 ms (exposure time: 12 ms). The imaging setup consisted of an Imago SensiCam CCD camera with a 640x480 chip (Till Photonics, Graefelfing, Germany) and a Polychromator IV (Till Photonics) that was coupled via an optical fiber into the upright microscope. Emitted fluorescence was detected through a 440 nm long-pass filter (LP440). Data were acquired as 80x60 frames using 8x8 on-chip binning. Images were recorded in analog-to-digital units (ADUs) and stored as 12-bit grayscale images. A depolarizing currrent pulse was applied just before frame 13 provoking calcium entry. The data were acquired by Andreas Pippow.
    Reference: JOUCLA ET AL. (2013) CELL CALCIUM. 54(2):71-85

    To read Data_POMC.fits into a Python session do:

    import fitsio
    import numpy as np
    fits = fitsio.FITS('Data_POMC.fits','r')
    fits

    To read Data_POMC.py into a Python session do:

    import Data_POMC
    help(Data_POMC)

    To read Data_POMC.json into a Python session do:

    import json
    import numpy as np
    with open("Data_POMC.json","r") as f:
    pomc = json.load(f) # pomc is a dictionary
    pomc_stack = np.array(pomc['stack'])
    print(pomc['metadata'])

    To read Data_POMC2.json into a Python session do:

    import json
    import numpy as np
    with open("Data_POMC2.json","r") as f:
    pomc = json.load(f) # pomc is a dictionary
    pomc_stack = np.reshape(pomc['stack'],(60,80,168),order='f')
    print(pomc['metadata'])

  4. T

    asset

    • tensorflow.org
    • opendatalab.com
    Updated May 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). asset [Dataset]. https://www.tensorflow.org/datasets/catalog/asset
    Explore at:
    Dataset updated
    May 7, 2020
    Description

    ASSET is a dataset for evaluating Sentence Simplification systems with multiple rewriting transformations, as described in "ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations." The corpus is composed of 2000 validation and 359 test original sentences that were each simplified 10 times by different annotators. The corpus also contains human judgments of meaning preservation, fluency and simplicity for the outputs of several automatic text simplification systems.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('asset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. Annotated 12 lead ECG dataset

    • zenodo.org
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12 lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3625007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students.
    It is used as test set on the paper:
    "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network".
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a 
    `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different 
    patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12
    different leads of the ECG exam. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 
    10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples).
    In order to make them all have the same size (4096 samples) we fill them with zeros
    on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648
    samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved
    in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should
    be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header).
    The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files.
    The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2
     agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a 
     third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in 
     "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way 
     it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  6. T

    unified_qa

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). unified_qa [Dataset]. https://www.tensorflow.org/datasets/catalog/unified_qa
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The UnifiedQA benchmark consists of 20 main question answering (QA) datasets (each may have multiple versions) that target different formats as well as various complex linguistic phenomena. These datasets are grouped into several formats/categories, including: extractive QA, abstractive QA, multiple-choice QA, and yes/no QA. Additionally, contrast sets are used for several datasets (denoted with "contrast_sets_"). These evaluation sets are expert-generated perturbations that deviate from the patterns common in the original dataset. For several datasets that do not come with evidence paragraphs, two variants are included: one where the datasets are used as-is and another that uses paragraphs fetched via an information retrieval system as additional evidence, indicated with "_ir" tags.

    More information can be found at: https://github.com/allenai/unifiedqa.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('unified_qa', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  7. CIFAR-10 Python

    • kaggle.com
    zip
    Updated Jan 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kris (2018). CIFAR-10 Python [Dataset]. https://www.kaggle.com/pankrzysiu/cifar10-python
    Explore at:
    zip(340613496 bytes)Available download formats
    Dataset updated
    Jan 27, 2018
    Authors
    Kris
    Description

    Context

    CIFAR-10 is the excellent Dataset for many Image processing experiments.

    Content

    Usage instructions

    in Keras

    from os import listdir, makedirs
    from os.path import join, exists, expanduser
    
    cache_dir = expanduser(join('~', '.keras'))
    if not exists(cache_dir):
      makedirs(cache_dir)
    datasets_dir = join(cache_dir, 'datasets') # /cifar-10-batches-py
    if not exists(datasets_dir):
      makedirs(datasets_dir)
    
    # If you have multiple input datasets, change the below cp command accordingly, typically:
    # !cp ../input/cifar10-python/cifar-10-python.tar.gz ~/.keras/datasets/
    !cp ../input/cifar-10-python.tar.gz ~/.keras/datasets/
    !ln -s ~/.keras/datasets/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
    !tar xzvf ~/.keras/datasets/cifar-10-python.tar.gz -C ~/.keras/datasets/
    

    general Python 3

    def unpickle(file):
      import pickle
      with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
      return dict
    
    !tar xzvf ../input/cifar-10-python.tar.gz
    

    then see section "Dataset layout" in https://www.cs.toronto.edu/~kriz/cifar.html for details

    Acknowledgements

    Downloaded directly from here:

    https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

    See description: https://www.cs.toronto.edu/~kriz/cifar.html

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  8. T

    imagenet2012

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

  9. H

    Advancing Open and Reproducible Water Data Science by Integrating Data...

    • beta.hydroshare.org
    • hydroshare.org
    zip
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery S. Horsburgh (2024). Advancing Open and Reproducible Water Data Science by Integrating Data Analytics with an Online Data Repository [Dataset]. https://beta.hydroshare.org/resource/45d3427e794543cfbee129c604d7e865/
    Explore at:
    zip(50.9 MB)Available download formats
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    HydroShare
    Authors
    Jeffery S. Horsburgh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scientific and related management challenges in the water domain require synthesis of data from multiple domains. Many data analysis tasks are difficult because datasets are large and complex; standard formats for data types are not always agreed upon nor mapped to an efficient structure for analysis; water scientists may lack training in methods needed to efficiently tackle large and complex datasets; and available tools can make it difficult to share, collaborate around, and reproduce scientific work. Overcoming these barriers to accessing, organizing, and preparing datasets for analyses will be an enabler for transforming scientific inquiries. Building on the HydroShare repository’s established cyberinfrastructure, we have advanced two packages for the Python language that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS), loading of data into performant structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then writing analysis results back to HydroShare for sharing and eventual publication. These capabilities reduce the technical burden for scientists associated with creating a computational environment for executing analyses by installing and maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub server. HydroShare users can leverage these tools to build, share, and publish more reproducible scientific workflows. The HydroShare Python Client and USGS NWIS Data Retrieval packages can be installed within a Python environment on any computer running Microsoft Windows, Apple MacOS, or Linux from the Python Package Index using the PIP utility. They can also be used online via the CUAHSI JupyterHub server (https://jupyterhub.cuahsi.org/) or other Python notebook environments like Google Collaboratory (https://colab.research.google.com/). Source code, documentation, and examples for the software are freely available in GitHub at https://github.com/hydroshare/hsclient/ and https://github.com/USGS-python/dataretrieval.

    This presentation was delivered as part of the Hawai'i Data Science Institute's regular seminar series: https://datascience.hawaii.edu/event/data-science-and-analytics-for-water/

  10. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  11. Munsell soil color chart: A hyperspectral dataset

    • zenodo.org
    zip
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riestiya Zain Fadillah; Riestiya Zain Fadillah; Hilda Deborah; Hilda Deborah (2023). Munsell soil color chart: A hyperspectral dataset [Dataset]. http://doi.org/10.5281/zenodo.8143355
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Riestiya Zain Fadillah; Riestiya Zain Fadillah; Hilda Deborah; Hilda Deborah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains hyperspectral images obtained using SPECIM IQ for the Munsell soil color chart (MSC).

    The hyperspectral images are stored in ENVI format. For those who are only interested in the endmember spectra for the MSC, we also provided the spectral library .sli and .hdr inside the endmembers folder.

    The acquisition details for each image can be found in the .hdr file and metadata folder inside the whole folder. For the whole image, the acquisition details are:

    Table 1. Acquisition details

    samples512
    lines512
    bands204
    default bands70, 53,19
    binning1,1
    tint (integration time)10 (ms)
    fps100
    wavelength range397.32 - 1003.58 nm

    The dataset is organized into several folders, each containing different types of datasets.

    • whole folder contains the entire scene hyperspectral image. This folder contains capture, metadata, and results subfolder.
      • .png inside the folder is natural color plotting (RGB from default bands in Table. 1) from captured hyperspectral image.
      • capture folder contains dark reference, white reference and radiance data
      • metadata folder contains the metadata of the acquisition and device settings.
      • results contains the reflectance calculated by the device (in .dat, .hdr and rendered natural plotting in .png ) from the hyperspectral camera and .png images of the scene, background, and viewfinder from the device's RGB camera.
    • chips folder contains only the cropped 20*20 voxels for each color chip reflectances. Each page has its own folder and each folder contains .hdr and .img for each color chip.

    • endmembers folder contains the spectral library (.sli and .hdr). Each page in MSC have their own .sli and .hdr.

    Some of the code snippets that might help to read the dataset

    using python spectral library to load the dataset

    from spectral import *
    import matplotlib.pyplot as plt
    
    # load the hyperspectral image .hdr and store it to a variable
    hsi = open_image(PATH)
    
    # get the natural RGB plotting of the hyperspectral image using the SPECIM main band
    hsi_rgb = hsi[:,:,[70,53,19]]
    
    # read the spectral library .sli and store it to a variable
    sli = open_image(PATH)
    # plot the first endmember 
    plt.plot(sli.spectra[0])
    # get the endmembers name
    sli.names

    if you have any question kindly reach me on riestiyf@stud.ntnu.no

  12. Italian Coronavirus Cases by Age group and Sex

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    janluke (2025). Italian Coronavirus Cases by Age group and Sex [Dataset]. https://www.kaggle.com/giangip/iccas/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    janluke
    Description

    Italy Coronavirus Cases by Age group and Sex (ICCAS)

    This repository contains datasets about the number of Italian Sars-CoV-2 confirmed cases and deaths disaggregated by age group and sex. The data is (automatically) extracted from pdf reports (like this) published by Istituto Superiore di Sanità (ISS) two times a week. A link to the most recent report can be found in this page under section "Documento esteso".

    PDF reports are usually published on Tuesday and Friday and contains data updated to the 4 p.m. of the day day before their release.

    I wrote a script that is runned periodically in order to automatically update this repository when a new report is published. The code is hosted in a separate repository.

    For feedback and issues refers to the GitHub repository.

    Data folder structure

    The data folder is structured as follows: data ├── by-date │ └── iccas_{date}.csv Dataset with cases/deaths updated to 4 p.m. of {date} └── iccas_full.csv Dataset with data from all reports (by date) The full dataset is obtained by concatenating all datasets in by-date and has an additional date column. If you use pandas, I suggest you to read this dataset using a multi-index on the first two columns: python import pandas as pd df = pd.read_csv('iccas_full.csv', index_col=(0, 1)) # ('date', 'age_group')

    NOTE: {date} is the date the data refers to, NOT the release date of the report it was extracted from: as written above, a report is usually released with a day of delay. For example, iccas_2020-03-19.csv contains data relative to 2020-03-19 which was extracted from the report published in 2020-03-20.

    Dataset details

    Each dataset in the by-date folder contains the same data you can find in "Table 1" of the corresponding ISS report. This table contains the number of confirmed cases, deaths and other derived information disaggregated by age group (0-9, 10-19, ..., 80-89, >=90) and sex.

    WARNING: the sum of male and female cases is not equal to the total number of cases, since the sex of some cases is unknown. The same applies to deaths.

    Below, {sex} can be male or female.

    ColumnDescription
    date(Only in iccas_full.csv) Date the format YYYY-MM-DD; numbers are updated to 4 p.m of this date
    age_groupValues: "0-9", "10-19", ..., "80-89", ">=90"
    casesNumber of confirmed cases (both sexes + unknown-sex; active + closed)
    deathsNumber of deaths (both sexes + unknown-sex)
    {sex}_casesNumber of cases of sex {sex}
    {sex}_deathsNumber of cases of sex {sex} ended up in death
    cases_percentage100 * cases / cases_of_all_ages
    deaths_percentage100 * deaths / deaths_of_all_ages
    fatality_rate100 * deaths / cases
    {sex}_cases_percentage100 * {sex}_cases / (male_cases + female_cases) (cases of unknown sex excluded)
    {sex}_deaths_percentage100 * {sex}_deaths / (male_deaths + female_deaths) (cases of unknown sex excluded)
    {sex}_fatality_rate100 * {sex}_deaths / {sex}_cases

    All columns that can be computed from absolute counts of cases and deaths (bottom half of the table above) were all re-computed to increase precision.

  13. h

    census-income

    • huggingface.co
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WC (2025). census-income [Dataset]. https://huggingface.co/datasets/cestwc/census-income
    Explore at:
    Dataset updated
    Jul 21, 2025
    Authors
    WC
    Description

    Dataset Card for Census Income (Adult)

    This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

    URLs

    url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.

  14. Li-ion Battery Aging Dataset

    • kaggle.com
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GIRITHARAN MANI (2024). Li-ion Battery Aging Dataset [Dataset]. https://www.kaggle.com/datasets/mystifoe77/nasa-battery-data-cleaned/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GIRITHARAN MANI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Overview

    This dataset provides a comprehensive view of the aging process of lithium-ion batteries, facilitating the estimation of their Remaining Useful Life (RUL). Originally sourced from NASA's open repository, the dataset has undergone meticulous preprocessing to enhance its analytical utility. The data is presented in a user-friendly CSV format after extracting relevant features from the original .mat files.

    Key Features of the Dataset

    1. Battery Performance Metrics:

      • Capacity: Measured over time to assess degradation.
      • Internal Resistance (Re): Represents the electrical resistance of the battery.
      • Charge Transfer Resistance (Rct): Indicates charge movement efficiency.
    2. Environmental Conditions:

      • Ambient Temperature: External temperature affecting battery performance.
    3. Identification Attributes:

      • Battery ID: Unique identifier for each battery tested.
      • Test ID: Links specific test conditions to outcomes.
      • UID & Filename: Traceable dataset references.
    4. Processed Data:

      • Missing values have been addressed.
      • Columns irrelevant to RUL estimation have been removed.
      • Skewness in the data has been corrected for statistical accuracy.
    5. Labels:

      • Degradation States: Categorized into intervals for easier interpretation.
      • Ranges include operational and failure states.

    Potential Applications

    1. Battery Health Monitoring:

      • Predict battery failure timelines.
      • Enhance battery maintenance strategies.
    2. Data Science and Machine Learning:

      • Model development for RUL prediction.
      • Feature engineering for predictive analysis.
    3. Research and Development:

      • Improve battery design.
      • Study the impact of environmental and operational conditions on battery life.

    Technical Details

    • File Format: CSV
    • Size: ~625.02 kB
    • Columns: 9
    • Data Points: Multiple observations across various tests.

    Tags

    • Keywords: Lithium-ion batteries, RUL, Battery Aging, Machine Learning, Data Analysis, Predictive Maintenance.

    License

    • Apache 2.0: Permits academic and commercial use.

    Usage Instructions

    1. Import the dataset into your data analysis tools (e.g., Python, R, MATLAB).
    2. Explore features to understand correlations and dependencies.
    3. Use machine learning models for RUL prediction.

    Provenance

    The dataset was retrieved from NASA's publicly available data repositories. It has been preprocessed to align with research and industrial standards for usability in analytical tasks.

    Call to Action

    Leverage this dataset to enhance your understanding of lithium-ion battery degradation and build models that could revolutionize energy storage solutions.

  15. wmt22_african

    • huggingface.co
    Updated May 15, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2007). wmt22_african [Dataset]. https://huggingface.co/datasets/allenai/wmt22_african
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2007
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Dataset Card for allenai/wmt22_african

      Dataset Summary
    

    This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages.

      How to use the data
    

    There are two ways to access the data:

    Via the Hugging Face Python datasets library

    from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/allenai/wmt22_african.

  16. T

    quality

    • tensorflow.org
    Updated Dec 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). quality [Dataset]. https://www.tensorflow.org/datasets/catalog/quality
    Explore at:
    Dataset updated
    Dec 20, 2022
    Description

    QuALITY, a multiple-choice, long-reading comprehension dataset.

    We provide only the raw version.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('quality', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  17. h

    eurosat_dataset

    • huggingface.co
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Honaker (2023). eurosat_dataset [Dataset]. https://huggingface.co/datasets/Honaker/eurosat_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2023
    Authors
    Chris Honaker
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for EuroSat

      How to Use
    

    Install datasets:

    pip install datasets

    How to use in Python

    from datasets import load_dataset train_data = load_dataset("Honaker/eurosat_dataset", split="train")

      Dataset Summary
    

    EuroSat is an image classification dataset with 10 different classes on satellite imagery. There is over 27,000 labeled images.

      Dataset Structure
    

    The dataset is structured as follows: DatasetDict({ train: Dataset({… See the full description on the dataset page: https://huggingface.co/datasets/Honaker/eurosat_dataset.

  18. Z

    [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...

    • data.niaid.nih.gov
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Shi (2024). [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification with Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208229
    Explore at:
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Bilian Ke
    Jiancheng Yang
    Bingbing Ni
    Hanspeter Pfister
    Zequan Liu
    Rui Shi
    Donglai Wei
    Lin Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]

    Abstract

    We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

    Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.

    Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

    Python Usage

    We recommend our official code to download, parse and use the MedMNIST dataset:

    % pip install medmnist% python

    To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

    from medmnist import PathMNIST

    train_dataset = PathMNIST(split="train")

    To enable automatic downloading by setting download=True:

    from medmnist import NoduleMNIST3D

    val_dataset = NoduleMNIST3D(split="val", download=True)

    Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

    from medmnist import ChestMNIST

    test_dataset = ChestMNIST(split="test", download=True, size=224)

    Citation

    If you find this project useful, please cite both v1 and v2 paper as:

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

    Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

    or using bibtex:

    @article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }

    @inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

    Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

    License

    The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

    The code is under Apache-2.0 License.

    Changelog

    v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.

    v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.

    v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.

    v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.

    v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.

    Note: This dataset is NOT intended for clinical use.

  19. STEAD subsample 4 CDiffSD

    • zenodo.org
    bin
    Updated Apr 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniele Trappolini; Daniele Trappolini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    STEAD Subsample Dataset for CDiffSD Training

    Overview

    This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

    Dataset Files

    The dataset includes the following files:

    • train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.
    • noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.
    • test: Used for the testing phase, structured similarly to train.
    • noise_test: Used for the testing phase, contains noise data for testing.

    Each file is structured to support the training and evaluation of seismic denoising models.

    Data

    The HDF5 files named noise contain two main datasets:

    • traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).
    • metadata: This dataset contains the names of the traces for each event.

    Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

    • p_arrival: Contains the arrival indices of P-waves, expressed in counts.
    • s_arrival: Contains the arrival indices of S-waves, also expressed in counts.


    Usage

    To load these files in a Python environment, use the following approach:

    ```python

    import h5py
    import numpy as np

    # Open the HDF5 file in read mode
    with h5py.File('train_noise.hdf5', 'r') as file:
    # Print all the main keys in the file
    print("Keys in the HDF5 file:", list(file.keys()))

    if 'traces' in file:
    # Access the dataset
    data = file['traces'][:10] # Load the first 10 traces

    if 'metadata' in file:
    # Access the dataset
    trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

    Ensure that the path to the file is correctly specified relative to your Python script.

    Requirements

    To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

    ```bash
    pip install numpy
    pip install h5py
    ```

  20. d

    MD17 data for graph2mat

    • data.dtu.dk
    txt
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arghya Bhowmik (2024). MD17 data for graph2mat [Dataset]. http://doi.org/10.11583/DTU.26195285.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    Technical University of Denmark
    Authors
    Arghya Bhowmik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Creators

    Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)

    Related publication

    The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21) https://github.com/BIG-MAP/graph2mat

    Short description

    This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of a subset of the MD17 aspirin dataset. The subset is taken from the third split in (https://doi.org/10.6084/m9.figshare.12672038.v3).

    SIESTA 5.0.0 was used to compute the dataset.

    Contents

    The dataset has two directories:

    • pseudos: Contains the pseudopotentials used for the calculation (obtained from http://www.pseudo-dojo.org/, type NC SR (ONCVPSP v0.5), PBE, standard accuracy)
    • splits: The data splits used in the published paper. Each file "splits_X.json" contains the splits for training size X.

    And then, three directories containing the calculations with different basis sets: - matrix_dataset_defsplit: Uses the default split-valence DZP basis in SIESTA. - matrix_dataset_optimsplit: Uses a split-valence DZP basis optimized for aspirin. - matrix_dataset_defnodes: Uses the default nodes DZP basis in SIESTA.

    Each of the basis directories has two subdirectories: - basis: Contains the files specifying the basis used for each atom. - runs: The results of running the SIESTA simulations. Contents are discussed next.

    The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.

    Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:

    import sisl
    
    matrix = sisl.get_sile("RUN.fdf").read_X()
    

    where X is hamiltonian, overlap, density_matrix or energy_density_matrix.

    To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).

    Cite this data

    https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark

    License

    This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Organization logo

SELTO Dataset

Explore at:
application/gzipAvailable download formats
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

  • E - Young's modulus [Pa]
  • ν - Poisson's ratio [-]
  • σ_ys - a yield stress [Pa]
  • h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

  • x, y, z - the indices that state the location of the voxel within the voxel mesh
  • Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized
  • Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension
  • F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]
  • density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset

dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd

root = ...
file_path = f'{root}/{i}.csv'
columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv'
info_column_names = ['E', 'ν', 'σ_ys', 'h']
df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch

def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
  shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
  voxels = [df['x'].values, df['y'].values, df['z'].values]

  Ω_design = torch.zeros(1, *shape, dtype=int)
  Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))

  Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
  Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
  Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
  Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)

  F = torch.zeros(3, *shape, dtype=dtype)
  F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
  F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
  F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)

  density = torch.zeros(1, *shape, dtype=dtype)
  density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)

  return Ω_design, Ω_Dirichlet, F, density

Search
Clear search
Close search
Google apps
Main menu