11 datasets found
  1. PHMSA Pipeline Safety Regions

    • catalog.data.gov
    Updated May 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pipeline and Hazardous Materials Safety Administration (PHMSA) (Point of Contact) (2025). PHMSA Pipeline Safety Regions [Dataset]. https://catalog.data.gov/dataset/phmsa-pipeline-safety-regions1
    Explore at:
    Dataset updated
    May 14, 2025
    Description

    The Pipeline and Hazardous Materials Safety Administration (PHMSA) Pipeline Safety Regions dataset was compiled on October 04, 2022 from the Pipeline and Hazardous Materials Safety Administration (PHMSA) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). PHMSA’s Office of Pipeline Safety (OPS) is responsible for protecting people and the environment from pipeline failures by analyzing pipeline safety and accident data; evaluating which safety standards need improvement and where new rulemakings are needed; setting and enforcing regulations and standards for the design, construction, operation, maintenance or abandonment of pipelines by pipeline companies; educating operators, states and communities on how to keep pipelines safe; facilitating research and development into better pipeline technologies; training state and federal pipeline inspectors; and administering grants to states and localities for pipeline inspections, damage prevention and emergency response. The PHMSA Pipeline Safety Regions layer contains polygon features representing each of the five regions, Central, Eastern, Southern, Southwest, and Western, that make up PHMSA’s Office of Pipeline Safety. Each region office is charged with ensuring the safe, reliable, and environmentally sound operation of the nation's pipeline infrastructure. Despite regional divisions the jurisdiction of PHMSA staff is nationwide and not limited to their regional area of responsibility. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1530287

  2. Z

    Data from: Packing provenance using CPM RO-Crate profile

    • data.niaid.nih.gov
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wittner, Rudolf (2023). Packing provenance using CPM RO-Crate profile [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7676923
    Explore at:
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    Soiland-Reyes, Stian
    Wittner, Rudolf
    Leo, Simone
    Gallo, Matej
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.

    As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.

    Description of the AI pipeline

    The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:

    Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.

    AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.

    AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.

    In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.

    Finally, all these artifacts are packed together in an RO-Crate.

    For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.

    Description of the RO-Crate

    Process Run Crate related aspects

    The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.

    Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.

    As a result, the crate consists the seven following “executables”:

    Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.

    Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.

    For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.

    Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.

    CPM RO-Crate related aspects

    The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.

    In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.

    Remarks

    The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.

    The input image files included in this RO-Crate are coming from the Camelyon16 dataset.

  3. a

    PHMSA Pipeline Safety Regions

    • data-usdot.opendata.arcgis.com
    • geodata.bts.gov
    • +2more
    Updated Oct 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Transportation: ArcGIS Online (2022). PHMSA Pipeline Safety Regions [Dataset]. https://data-usdot.opendata.arcgis.com/datasets/phmsa-pipeline-safety-regions
    Explore at:
    Dataset updated
    Oct 17, 2022
    Dataset authored and provided by
    U.S. Department of Transportation: ArcGIS Online
    Area covered
    Description

    The Pipeline and Hazardous Materials Safety Administration (PHMSA) Pipeline Safety Regions dataset was compiled on October 04, 2022 from the Pipeline and Hazardous Materials Safety Administration (PHMSA) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). PHMSA’s Office of Pipeline Safety (OPS) is responsible for protecting people and the environment from pipeline failures by analyzing pipeline safety and accident data; evaluating which safety standards need improvement and where new rulemakings are needed; setting and enforcing regulations and standards for the design, construction, operation, maintenance or abandonment of pipelines by pipeline companies; educating operators, states and communities on how to keep pipelines safe; facilitating research and development into better pipeline technologies; training state and federal pipeline inspectors; and administering grants to states and localities for pipeline inspections, damage prevention and emergency response. The PHMSA Pipeline Safety Regions layer contains polygon features representing each of the five regions, Central, Eastern, Southern, Southwest, and Western, that make up PHMSA’s Office of Pipeline Safety. Each region office is charged with ensuring the safe, reliable, and environmentally sound operation of the nation's pipeline infrastructure. Despite regional divisions the jurisdiction of PHMSA staff is nationwide and not limited to their regional area of responsibility.

  4. f

    DataSheet1_ML-GAP: machine learning-enhanced genomic analysis pipeline using...

    • frontiersin.figshare.com
    • figshare.com
    pdf
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melih Agraz; Dincer Goksuluk; Peng Zhang; Bum-Rak Choi; Richard T. Clements; Gaurav Choudhary; George Em Karniadakis (2024). DataSheet1_ML-GAP: machine learning-enhanced genomic analysis pipeline using autoencoders and data augmentation.PDF [Dataset]. http://doi.org/10.3389/fgene.2024.1442759.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 27, 2024
    Dataset provided by
    Frontiers
    Authors
    Melih Agraz; Dincer Goksuluk; Peng Zhang; Bum-Rak Choi; Richard T. Clements; Gaurav Choudhary; George Em Karniadakis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThe advent of RNA sequencing (RNA-Seq) has significantly advanced our understanding of the transcriptomic landscape, revealing intricate gene expression patterns across biological states and conditions. However, the complexity and volume of RNA-Seq data pose challenges in identifying differentially expressed genes (DEGs), critical for understanding the molecular basis of diseases like cancer.MethodsWe introduce a novel Machine Learning-Enhanced Genomic Data Analysis Pipeline (ML-GAP) that incorporates autoencoders and innovative data augmentation strategies, notably the MixUp method, to overcome these challenges. By creating synthetic training examples through a linear combination of input pairs and their labels, MixUp significantly enhances the model’s ability to generalize from the training data to unseen examples.ResultsOur results demonstrate the ML-GAP’s superiority in accuracy, efficiency, and insights, particularly crediting the MixUp method for its substantial contribution to the pipeline’s effectiveness, advancing greatly genomic data analysis and setting a new standard in the field.DiscussionThis, in turn, suggests that ML-GAP has the potential to perform more accurate detection of DEGs but also offers new avenues for therapeutic intervention and research. By integrating explainable artificial intelligence (XAI) techniques, ML-GAP ensures a transparent and interpretable analysis, highlighting the significance of identified genetic markers.

  5. Packing provenance using CPM RO-Crate profile

    • zenodo.org
    zip
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rudolf Wittner; Rudolf Wittner; Matej Gallo; Matej Gallo; Simone Leo; Simone Leo; Stian Soiland-Reyes; Stian Soiland-Reyes (2023). Packing provenance using CPM RO-Crate profile [Dataset]. http://doi.org/10.5281/zenodo.7676924
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rudolf Wittner; Rudolf Wittner; Matej Gallo; Matej Gallo; Simone Leo; Simone Leo; Stian Soiland-Reyes; Stian Soiland-Reyes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.

    As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.

    Description of the AI pipeline

    The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:

    • Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.

    • AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.

    • AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.

    In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.

    Finally, all these artifacts are packed together in an RO-Crate.

    For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.

    Description of the RO-Crate

    Process Run Crate related aspects

    The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.

    Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.

    As a result, the crate consists the seven following “executables”:

    • Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.

    • Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.

    For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.

    Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.

    CPM RO-Crate related aspects

    The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.

    In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.

    Remarks

    The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.

    The input image files included in this RO-Crate are coming from the Camelyon16 dataset.

  6. d

    AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and...

    • datarade.ai
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MealMe (2024). AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites [Dataset]. https://datarade.ai/data-products/ai-training-data-annotated-checkout-flows-for-retail-resta-mealme
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset authored and provided by
    MealMe
    Area covered
    United States of America
    Description

    AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites Overview

    Unlock the next generation of agentic commerce and automated shopping experiences with this comprehensive dataset of meticulously annotated checkout flows, sourced directly from leading retail, restaurant, and marketplace websites. Designed for developers, researchers, and AI labs building large language models (LLMs) and agentic systems capable of online purchasing, this dataset captures the real-world complexity of digital transactions—from cart initiation to final payment.

    Key Features

    Breadth of Coverage: Over 10,000 unique checkout journeys across hundreds of top e-commerce, food delivery, and service platforms, including but not limited to Walmart, Target, Kroger, Whole Foods, Uber Eats, Instacart, Shopify-powered sites, and more.

    Actionable Annotation: Every flow is broken down into granular, step-by-step actions, complete with timestamped events, UI context, form field details, validation logic, and response feedback. Each step includes:

    Page state (URL, DOM snapshot, and metadata)

    User actions (clicks, taps, text input, dropdown selection, checkbox/radio interactions)

    System responses (AJAX calls, error/success messages, cart/price updates)

    Authentication and account linking steps where applicable

    Payment entry (card, wallet, alternative methods)

    Order review and confirmation

    Multi-Vertical, Real-World Data: Flows sourced from a wide variety of verticals and real consumer environments, not just demo stores or test accounts. Includes complex cases such as multi-item carts, promo codes, loyalty integration, and split payments.

    Structured for Machine Learning: Delivered in standard formats (JSONL, CSV, or your preferred schema), with every event mapped to action types, page features, and expected outcomes. Optional HAR files and raw network request logs provide an extra layer of technical fidelity for action modeling and RLHF pipelines.

    Rich Context for LLMs and Agents: Every annotation includes both human-readable and model-consumable descriptions:

    “What the user did” (natural language)

    “What the system did in response”

    “What a successful action should look like”

    Error/edge case coverage (invalid forms, OOS, address/payment errors)

    Privacy-Safe & Compliant: All flows are depersonalized and scrubbed of PII. Sensitive fields (like credit card numbers, user addresses, and login credentials) are replaced with realistic but synthetic data, ensuring compliance with privacy regulations.

    Each flow tracks the user journey from cart to payment to confirmation, including:

    Adding/removing items

    Applying coupons or promo codes

    Selecting shipping/delivery options

    Account creation, login, or guest checkout

    Inputting payment details (card, wallet, Buy Now Pay Later)

    Handling validation errors or OOS scenarios

    Order review and final placement

    Confirmation page capture (including order summary details)

    Why This Dataset?

    Building LLMs, agentic shopping bots, or e-commerce automation tools demands more than just page screenshots or API logs. You need deeply contextualized, action-oriented data that reflects how real users interact with the complex, ever-changing UIs of digital commerce. Our dataset uniquely captures:

    The full intent-action-outcome loop

    Dynamic UI changes, modals, validation, and error handling

    Nuances of cart modification, bundle pricing, delivery constraints, and multi-vendor checkouts

    Mobile vs. desktop variations

    Diverse merchant tech stacks (custom, Shopify, Magento, BigCommerce, native apps, etc.)

    Use Cases

    LLM Fine-Tuning: Teach models to reason through step-by-step transaction flows, infer next-best-actions, and generate robust, context-sensitive prompts for real-world ordering.

    Agentic Shopping Bots: Train agents to navigate web/mobile checkouts autonomously, handle edge cases, and complete real purchases on behalf of users.

    Action Model & RLHF Training: Provide reinforcement learning pipelines with ground truth “what happens if I do X?” data across hundreds of real merchants.

    UI/UX Research & Synthetic User Studies: Identify friction points, bottlenecks, and drop-offs in modern checkout design by replaying flows and testing interventions.

    Automated QA & Regression Testing: Use realistic flows as test cases for new features or third-party integrations.

    What’s Included

    10,000+ annotated checkout flows (retail, restaurant, marketplace)

    Step-by-step event logs with metadata, DOM, and network context

    Natural language explanations for each step and transition

    All flows are depersonalized and privacy-compliant

    Example scripts for ingesting, parsing, and analyzing the dataset

    Flexible licensing for research or commercial use

    Sample Categories Covered

    Grocery delivery (Instacart, Walmart, Kroger, Target, etc.)

    Restaurant takeout/delivery (Ub...

  7. Data and script pipeline for: Common to rare transfer learning (CORAL)...

    • zenodo.org
    bin, html, tsv
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.15524215
    Explore at:
    bin, html, tsvAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Otso Ovaskainen; Otso Ovaskainen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

    System requirements

    · The software can be used in any operating system where R can be installed.

    · We have developed and tested the software in a windows environment with R version 4.3.1.

    · Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

    · Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

    · Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

    · The use of the software does not require any non-standard hardware.

    Installation guide

    · The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

    Demo 1: Software demo with simulated data

    The software demonstration consists of two R-markdown files:

    · D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

    · D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

    Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

    Demo 2: Software demo with a small subset of the data used in the paper

    The software demonstration consists of one R-markdown file:

    MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

    Scripts and data for reproducing the results presented in the paper (Demo 3)

    The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

    · S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

    · S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

    · S03_import_posterior – imports the posterior distributions sampled by the initial model.

    · S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

    · S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

    · S06_construct_coral_priors – calculate CORAL prior parameters.

    The remaining scripts evaluate the model:

    · S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

    · S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

    · S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

    · S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R2 produced by cross-validation. Generates Fig. 4 of the paper.

    · S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

    · S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

    · S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

    Pre-processing scripts:

    · P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

    · P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

    · P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

    Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.

    ENA Accession numbers

    All raw sequence data are archived on mBRAVE and are publicly available in the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena; project accession number PRJEB86111; run accession numbers ERR15018787-ERR15009869; sample IDs for each accession and download URLs are provided in the file ENA_read_accessions.tsv).

  8. P

    ImDrug Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lanqing Li; Liang Zeng; Ziqi Gao; Shen Yuan; Yatao Bian; Bingzhe Wu; Hengtong Zhang; Yang Yu; Chan Lu; Zhipeng Zhou; Hongteng Xu; Jia Li; Peilin Zhao; Pheng-Ann Heng, ImDrug Dataset [Dataset]. https://paperswithcode.com/dataset/imdrug
    Explore at:
    Authors
    Lanqing Li; Liang Zeng; Ziqi Gao; Shen Yuan; Yatao Bian; Bingzhe Wu; Hengtong Zhang; Yang Yu; Chan Lu; Zhipeng Zhou; Hongteng Xu; Jia Li; Peilin Zhao; Pheng-Ann Heng
    Description

    ImDrug is a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It features modularized components including formulation of learning setting and tasks, dataset curation, standardized evaluation, and baseline algorithms. It also provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis.

  9. Dataset related to the article "CAD-RADS scoring of coronary CT angiography...

    • zenodo.org
    bin
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattia Chiesa; Mattia Chiesa; Daniele Andreini; Daniele Andreini; Andrea Baggiano; Andrea Baggiano; saima mushtaq; saima mushtaq; Gianluca Pontone; Gianluca Pontone; Gualtiero Colombo; Gualtiero Colombo (2024). Dataset related to the article "CAD-RADS scoring of coronary CT angiography with Multi-Axis Vision Transformer: A clinically-inspired deep learning pipeline" [Dataset]. http://doi.org/10.5281/zenodo.10667324
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mattia Chiesa; Mattia Chiesa; Daniele Andreini; Daniele Andreini; Andrea Baggiano; Andrea Baggiano; saima mushtaq; saima mushtaq; Gianluca Pontone; Gianluca Pontone; Gualtiero Colombo; Gualtiero Colombo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains raw data related to the article "CAD-RADS scoring of coronary CT angiography with Multi-Axis Vision Transformer: A clinically-inspired deep learning pipeline".

    Background and objective: The standard non-invasive imaging technique used to assess the severity and extent of Coronary Artery Disease (CAD) is Coronary Computed Tomography Angiography (CCTA). However, manual grading of each patient’s CCTA according to the CAD-Reporting and Data System (CAD-RADS) scoring is time-consuming and operator-dependent, especially in borderline cases. This work proposes a fully automated, and visually explainable, deep learning pipeline to be used as a decision support system for the CAD screening procedure. The pipeline performs two classification tasks: firstly, identifying patients who require further clinical investigations and secondly, classifying patients into subgroups based on the degree of stenosis, according to commonly used CAD-RADS thresholds.
    Methods: The pipeline pre-processes multiplanar projections of the coronary arteries, extracted from the original CCTAs, and classifies them using a fine-tuned Multi-Axis Vision Transformer architecture. To emulate the current clinical practice, the model is trained to assign a per-patient score by stacking the bi-dimensional longitudinal cross-sections of the three main coronary arteries along the channel dimension. Furthermore, it generates visually interpretable maps to assess the reliability of the predictions.
    Results: When run on a database of 1873 three-channel images of 253 patients collected at the Monzino Cardiology Center in Milan, the pipeline obtained an AUC of 0.87 and 0.93 for the two classification tasks, respectively.
    Conclusion: According to our knowledge, this is the first model trained to assign CAD-RADS scores learning solely from patient scores and not requiring finer imaging annotation steps that are not part of the clinical routine.

  10. Data from: Modeling short visual events through the BOLD Moments video fMRI...

    • openneuro.org
    Updated Jul 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Lahner; Kshitij Dwivedi; Polina Iamshchinina; Monika Graumann; Alex Lascelles; Gemma Roig; Alessandro Thomas Gifford; Bowen Pan; SouYoung Jin; N.Apurva Ratan Murty; Kendrick Kay; Radoslaw Cichy*; Aude Oliva* (2024). Modeling short visual events through the BOLD Moments video fMRI dataset and metadata. [Dataset]. http://doi.org/10.18112/openneuro.ds005165.v1.0.4
    Explore at:
    Dataset updated
    Jul 21, 2024
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Benjamin Lahner; Kshitij Dwivedi; Polina Iamshchinina; Monika Graumann; Alex Lascelles; Gemma Roig; Alessandro Thomas Gifford; Bowen Pan; SouYoung Jin; N.Apurva Ratan Murty; Kendrick Kay; Radoslaw Cichy*; Aude Oliva*
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the data repository for the BOLD Moments Dataset. This dataset contains brain responses to 1,102 3-second videos across 10 subjects. Each subject saw the 1,000 video training set 3 times and the 102 video testing set 10 times. Each video is additionally human-annotated with 15 object labels, 5 scene labels, 5 action labels, 5 sentence text descriptions, 1 spoken transcription, 1 memorability score, and 1 memorability decay rate.

    Overview of contents:

    The home folder (everything except the derivatives/ folder) contains the raw data in BIDS format before any preprocessing. Download this folder if you want to run your own preprocessing pipeline (e.g., fMRIPrep, HCP pipeline).

    To comply with licensing requirements, the stimulus set is not available here on OpenNeuro (hence the invalid BIDS validation). See the GitHub repository (https://github.com/blahner/BOLDMomentsDataset) to download the stimulus set and stimulus set derivatives (like frames). To make this dataset perfectly BIDS compliant for use with other BIDS-apps, you may need to copy the 'stimuli' folder from the downloaded stimulus set into the parent directory.

    The derivatives folder contains all data derivatives, including the stimulus annotations (./derivatives/stimuli_metadata/annotations.json), model weight checkpoints for a TSM ResNet50 model trained on a subset of Multi-Moments in Time, and prepared beta estimates from two different fMRIPrep preprocessing pipelines (./derivatives/versionA and ./derivatives/versionB).

    VersionA was used in the main manuscript, and versionB is detailed in the manuscript's supplementary. If you are starting a new project, we highly recommend you use the prepared data in ./derivatives/versionB/ because of its better registration, use of GLMsingle, and availability in more standard/non-standard output spaces. Code used in the manuscript is located at the derivatives version level. For example, the code used in the main manuscript is located under ./derivatives/versionA/scripts. Note that versionA prepared data is very large due to beta estimates for 9 TRs per video. See this GitHub repo for starter code demonstrating basic usage and dataset download scripts: https://github.com/blahner/BOLDMomentsDataset. See this GitHub repo for the TSM ResNet50 model training and inference code: https://github.com/pbw-Berwin/M4-pretrained

    Data collection notes: All data collection notes explained below are detailed here for the purpose of full transparency and should be of no concern to researchers using the data i.e. these inconsistencies have been attended to and integrated into the BIDS format as if these exceptions did not occur. The correct pairings between field maps and functional runs are detailed in the .json sidecars accompanying each field map scan.

    Subject 2: Session 1: Subject repositioned head for comfort after the third resting state scan, approximately 1 hour into the session. New scout and field map scans were taken. In the case of applying a susceptibility distortion correction analysis, session 1 therefore has two sets of field maps, denoted by “run-1” and “run-2” in the filename. The “IntendedFor” field in the field map’s identically named .json sidecar file specifies which functional scans correspond to which field map.

    Session 4: Completed over two separate days due to subject feeling sleepy. All 3 testing runs and 6/10 training runs were completed on the first day, and the last 4 training runs were completed on the second day. Each of the two days for session 4 had its own field map. This did not interfere with session 5. All scans across both days belonging to session 4 were analyzed as if they were collected on the same day. In the case of applying a susceptibility distortion correction analysis, session 4 therefore has two sets of field maps, denoted by “run-1” and “run-2” in the filename. The “IntendedFor” field in the field map’s identically named .json sidecar file specifies which functional scans correspond to which field map.

    Subject 4: Sessions 1 and 2: The fifth (out of 5) localizer run from session 1 was completed at the end of session 2 due to a technical error. This localizer run therefore used the field map from session 2. In the case of applying a susceptibility distortion correction analysis, session 1 therefore has two sets of field maps, denoted by “run-1” and “run-2” in the filename. The “IntendedFor” field in the field map’s identically named .json sidecar file specifies which functional scans correspond to which field map.

    Subject 10: Session 5: Subject moved a lot to readjust earplug after the third functional run (1 test and 2 training runs completed). New field map scans were collected. In the case of applying a susceptibility distortion correction analysis, session 5 therefore has two sets of field maps, denoted by “run-1” and “run-2” in the filename. The “IntendedFor” field in the field map’s identically named .json sidecar file specifies which functional scans correspond to which field map.

  11. P

    FinRL-Meta Dataset

    • paperswithcode.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao-Yang Liu; Ziyi Xia; Jingyang Rui; Jiechao Gao; Hongyang Yang; Ming Zhu; Christina Dan Wang; Zhaoran Wang; Jian Guo (2025). FinRL-Meta Dataset [Dataset]. https://paperswithcode.com/dataset/finrl-meta
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Xiao-Yang Liu; Ziyi Xia; Jingyang Rui; Jiechao Gao; Hongyang Yang; Ming Zhu; Christina Dan Wang; Zhaoran Wang; Jian Guo
    Description

    FinRL-Meta is universe of market environments for data-driven financial reinforcement learning. It follows the de facto standard of OpenAI Gym and the lean principle of software development. It has the following unique features of layered structure and extensibility, training-testing-trading pipeline and plug-and-play mode.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pipeline and Hazardous Materials Safety Administration (PHMSA) (Point of Contact) (2025). PHMSA Pipeline Safety Regions [Dataset]. https://catalog.data.gov/dataset/phmsa-pipeline-safety-regions1
Organization logo

PHMSA Pipeline Safety Regions

Explore at:
Dataset updated
May 14, 2025
Description

The Pipeline and Hazardous Materials Safety Administration (PHMSA) Pipeline Safety Regions dataset was compiled on October 04, 2022 from the Pipeline and Hazardous Materials Safety Administration (PHMSA) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). PHMSA’s Office of Pipeline Safety (OPS) is responsible for protecting people and the environment from pipeline failures by analyzing pipeline safety and accident data; evaluating which safety standards need improvement and where new rulemakings are needed; setting and enforcing regulations and standards for the design, construction, operation, maintenance or abandonment of pipelines by pipeline companies; educating operators, states and communities on how to keep pipelines safe; facilitating research and development into better pipeline technologies; training state and federal pipeline inspectors; and administering grants to states and localities for pipeline inspections, damage prevention and emergency response. The PHMSA Pipeline Safety Regions layer contains polygon features representing each of the five regions, Central, Eastern, Southern, Southwest, and Western, that make up PHMSA’s Office of Pipeline Safety. Each region office is charged with ensuring the safe, reliable, and environmentally sound operation of the nation's pipeline infrastructure. Despite regional divisions the jurisdiction of PHMSA staff is nationwide and not limited to their regional area of responsibility. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1530287

Search
Clear search
Close search
Google apps
Main menu